This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical structure. Try our demo on HuggingFace!
If you use MolScribe in your research, please cite our paper.
@article{
MolScribe,
title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},
author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},
journal = {Journal of Chemical Information and Modeling},
publisher = {American Chemical Society ({ACS})},
doi = {10.1021/acs.jcim.2c01480},
year = 2023,
}
Please check out our subsequent works on parsing chemical diagrams:
Option 1: Install MolScribe with pip
pip install MolScribe
Option 2: Run the following command to install the package and its dependencies
git clone git@github.com:thomas0809/MolScribe.git
cd MolScribe
python setup.py install
Download the MolScribe checkpoint from HuggingFace Hub and predict molecular structures:
import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')
model = MolScribe(ckpt_path, device=torch.device('cpu'))
output = model.predict_image_file('assets/example.png', return_atoms_bonds=True, return_confidence=True)
The output is a dictionary, with the following format
{
'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',
'molfile': '***',
'confidence': 0.9175,
'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],
'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]
}
Please refer to molscribe/interface.py
and notebook/predict.ipynb
for details and other available APIs.
For development or reproducing the experiments, please follow the instructions below.
Install the required packages
pip install -r requirements.txt
For training or evaluation, please download the corresponding datasets to data/
.
Training data:
Datasets | Description |
---|---|
USPTO Download |
Downloaded from USPTO, Grant Red Book. |
PubChem Download |
Molecules are downloaded from PubChem, and images are dynamically rendered during training. |
Benchmarks:
Category | Datasets | Description |
---|---|---|
Synthetic Download |
Indigo ChemDraw |
Images are rendered by Indigo and ChemDraw. |
Realistic Download |
CLEF UOB USPTO Staker ACS |
CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. ACS is a new dataset collected by ourself. |
Perturbed Download |
CLEF UOB USPTO Staker |
Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/ |
Our model checkpoints can be downloaded from Dropbox or HuggingFace Hub.
Model architecture:
Download the model checkpoint to reproduce our experiments:
mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth
python predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png
MolScribe prediction interface is in molscribe/interface.py
.
See python script predict.py
or jupyter notebook notebook/predict.ipynb
for example usage.
bash scripts/eval_uspto_joint_chartok_1m680k.sh
The script uses one GPU and batch size of 64 by default. If more GPUs are available, update NUM_GPUS_PER_NODE
and
BATCH_SIZE
for faster evaluation.
bash scripts/train_uspto_joint_chartok_1m680k.sh
The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.
During training, we use a modified code of Indigo (included in molscribe/indigo/
).
We implement a standalone evaluation script evaluate.py
. Example usage:
python evaluate.py \
--gold_file data/real/acs.csv \
--pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \
--pred_field post_SMILES
The prediction should be saved in a csv file, with columns image_id
for the index (must match the gold file),
and SMILES
for predicted SMILES. If prediction has a different column name, specify it with --pred_field
.
The result contains three scores: