BluePhos: An automated pipeline optimizing the synthesis and analysis of blue phosphorescent materials.
The BluePhos pipeline is an automated computational tool streamlining the development and analysis of blue phosphorescent materials. It blends computational chemistry with machine learning to adeptly predict and hone the properties of essential compounds in light-emitting tech.
The BluePhos pipeline functions like an automated assembly line, with a structured yet adaptable workflow that distributes tasks efficiently across computing resources. It optimizes batch processing and resource allocation, processing molecules individually for streamlined operation.
The current version of the pipeline comprises the following sequential tasks:
Planned enhancements include:
git clone https://github.com/ssec-jhu/bluephos.git
Navigate to the Bluephos directory and create the blue_env environment using Conda:
cd bluephos
conda env create -f blue_env.yml
conda activate blue_env
After cloning, navigate to the project directory:
cd bluephos
python bluephos_pipeline.py [options]
Argument | Required | Type | Default | Description |
---|---|---|---|---|
--halides | No | String | None | Path to the CSV file containing halides data. Required when no input directory or ligand SMILES CSV file is specified. |
--acids | No | String | None | Path to the CSV file containing boronic acids data. Required when no input directory or ligand SMILES CSV file is specified. |
--features | Yes | String | None | Path to the element feature file used for neural network predictions. |
--train | Yes | String | None | Path to the train stats file used to normalize input data. |
--weights | Yes | String | None | Path to the full energy model weights file for the neural network. |
--input_dir | No | String | None | Directory containing input parquet files for rerun mode. Used when mode 3 is not specified. |
--out-dir | No | String | None | Directory where the pipeline's output files will be saved. If not specified, defaults to the current directory. |
--t_nn | No | Float | 1.5 | Threshold for the neural network 'z' score. Candidates with an absolute 'z' score below this threshold will be considered. |
--t_ste | No | Float | 1.9 | Threshold for 'ste' (Singlet-Triplet Energy gap). Candidates with an absolute 'ste' value below this threshold will be considered. |
--t_dft | No | Float | 2.0 | Threshold for 'dft' (dft_energy_diff). Candidates with an absolute 'dft' value below this threshold will be considered. |
--ligand_smiles | No | String | None | Path to the ligand SMILE file containing ligand SMILES data. If provided, mode 3 is used. |
--no_xtb | No | Bool | True | Disable xTB optimization. Defaults (no this flag) to xTB optimization enabled; use --no_xtb to disable it. |
-If a ligand SMILES CSV file (--ligand_smiles) is provided, the pipeline operates in mode 3.
-If an input directory (--input_dir) is specified, and no ligand SMILES CSV file is provided, the pipeline operates in mode 2.
-If neither a ligand SMILES CSV file nor an input directory is provided, the pipeline defaults to mode 1.
python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
python bluephos_pipeline.py --input_dir path/to/parquet_directory --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
python bluephos_pipeline.py --ligand_smiles path/to/ligand_smiles.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --t_nn 2.0 --t_ste 2.5
python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --dft_package ase
python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --no_xtb
tox -e run-pipeline -- --halide /path/to/aromatic_halides.csv --acid /path/to/aromatic_boronic_acids.csv --feature /path/to/element_features.csv --train /path/to/train_stats.csv --weight /path/to/model_weights.pt -o /path/to/output_dir/
Replace /path/to/... with the actual paths to your datasets and parameter files.
To run the pipeline using example data provided in the repository:
tox -e run-pipeline -- --halide ./tests/input/aromatic_halides_with_id.csv --acid ./tests/input/aromatic_boronic_acids_with_id.csv --feature ./bluephos/parameters/element_features.csv --train ./bluephos/parameters/train_stats.csv --weight ./bluephos/parameters/full_energy_model_weights.pt -o .
This command uses test data to demonstrate the pipeline's functionality, ideal for initial testing and familiarization.
Pandas can be used to read and analyze Parquet files.
import pandas as pd
df = pd.read_parquet('08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet')
print(df.describe())
DuckDB provides an efficient way to query Parquet files directly using SQL syntax.
import duckdb as ddb
query_result = ddb.query('''SELECT * FROM '08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet' LIMIT 10''')
print(query_result.to_df())
We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to contribute to this project.