otori-bird / retrosynthesis

MIT License
58 stars 14 forks source link

Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction

cross-attention

[arxiv] [chemical science]

The directory contains source code of the article: Zhong et al's Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction.

In this work, we propose root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES, to narrow the string representation discrepancy for more efficient retrosynthesis. Here we provide the source code of our method.

Data and Model

USPTO-50K: https://github.com/Hanjun-Dai/GLN

USPTO-MIT: https://github.com/wengong-jin/nips17-rexgen/blob/master/USPTO/data.zip

USPTO-FULL: https://github.com/Hanjun-Dai/GLN

Our augmented datasets, checkpoints and 200 examples of attention maps: https://drive.google.com/drive/folders/1c15h6TNU6MSNXzqB6dQVMWOs2Aae8hs6?usp=sharing

Environment Preparation

Please make sure you have installed anaconda. The version about pytorch and cudatoolkit should be depended on your machine. The version of pytorch should not be smaller than 1.6 according to the OpenNMT-py.

conda create -n r-smiles python=3.7 \
conda activate r-smiles \
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 \
pip install pandas==1.3.4 \
pip install textdistance==4.2.2 \
conda install rdkit=2020.09.1.0 -c rdkit \
pip install OpenNMT-py==2.2.0

Overview of the workflow

We follow the OpenNMT architecture to train the Transformer. The workflow is

We have placed all the config files in the pretrain_finetune and train-from-scratch folders and categorized by task. You can use config files directly with OpenNMT commands or modify them according to your needs (such as different datasets or using our prepared checkpoints).

Data preprocessing

Pretrain and Finetune

​ (If you want to pretrain the model, you should also generate all the finetune datasets in advance to build a full vocab.)

Train from scratch

Translate and score

P2R / R2P / P2S / S2R

P2S2R

Generate your own R-SMILES

Translate with our prepared checkpoint

After downloading our checkpoint files, you should start by creating a translation config file with the following content:

model: <the path of trained model>
src: <the path of input>
output: <the path of output>
gpu: 0
beam_size: 10
n_best: 10
batch_size: 8192
batch_type: tokens
max_length: 1000
seed: 0

Then you can run the OpenNMT command to get the predictions:

onmt_translate -config <the path of config file>

Results

Forward Prediction

Top-50 exact match accuracy on the USPTO-MIT.

Reagents Separated:

Model Top-1 Top-2 Top-5 Top-10 Top-20
Molecular Transformer 90.5 93.7 95.3 96.0 96.5
MEGAN 89.3 92.7 95.6 96.7 97.5
Augmented Transformer 91.9 95.4 97.0 - -
Chemformer 92.8 - 94.9 95.0 -
Ours 92.3 95.9 97.5 98.1 98.5

Reagents Mixed:

Model Top-1 Top-2 Top-5 Top-10 Top-20
Molecular Transformer 88.7 92.1 94.2 94.9 95.4
MEGAN 86.3 90.3 94.0 95.4 96.6
Augmented Transformer 90.4 94.6 96.5 - -
Chemformer 91.3 - 93.7 94.0 -
Ours 91.0 95.0 96.8 97.0 97.3

Retrosynthesis

Top-50 exact match accuracy on the USPTO-50K.

Model Top-1 Top-3 Top-5 Top-10 Top-20 Top-50
GraphRetro 53.7 68.3 72.2 75.5 - -
RetroPrime 51.4 70.8 74.0 76.1 - -
AT 53.5 - 81.0 85.7 - -
LocalRetro 53.4 77.5 85.9 92.4 - 97.7
Ours(P2S2R) 49.1 68.4 75.8 82.2 85.1 88.7
Ours(P2R) 56.3 79.2 86.2 91.0 93.1 94.6

Top-50 exact match accuracy on the USPTO-MIT.

Model Top-1 Top-3 Top-5 Top-10 Top-20 Top-50
LocalRetro 54.1 73.7 79.4 84.4 - 90.4
AutopSynRoute 54.1 71.8 76.9 81.8 - -
RetroTRAE 58.3 - - - - -
Ours(P2R) 60.3 78.2 83.2 87.3 89.7 91.6

Top-50 exact match accuracy on the USPTO-FULL.

Model Top-1 Top-3 Top-5 Top-10 Top-20 Top-50
RetroPrime 44.1 - - 68.5 - -
AT 46.2 - - 73.3 - -
LocalRetro 39.1 53.3 58.4 63.7 67.5 70.7
Ours(P2R) 48.9 66.6 72.0 76.4 80.4 83.1

Top-10 accuracy of product-to-synthon on the USPTO-50K.

Model Top-1 Top-3 Top-5 Top-10
G2Gs 75.8 83.9 85.3 85.6
GraphRetro 70.8 92.2 93.7 94.5
RetroPrime 65.6 87.7 92.0 -
Ours 75.2 94.4 97.9 99.1

Top-10 accuracy of synthon-to-reactant on the USPTO-50K.

Model Top-1 Top-3 Top-5 Top-10
G2Gs 61.1 81.5 96.7 90.0
GraphRetro 75.6 87.7 92.9 96.3
RetroPrime 73.4 87.9 89.8 90.4
Ours 73.9 91.9 95.2 97.4

Acknowledgement

OpenNMT-py: https://github.com/OpenNMT/OpenNMT-py