wangxr0526 / RetroPrime

Code for Single-step Retrosynthesis model Retroprime
MIT License
33 stars 6 forks source link

RetroPrime

This is the code for the "RetroPrime: A Diverse, Plausible and Transformer-based Method for Single-Step Retrosynthesis Predictions" Chemical Engineering Journal.
Preprint Chemrxiv. \ To implement our models we were based on OpenNMT-py (v0.4.1).

RetroPrime Pipline

Pipline

Install requirements

Create a new conda environment:

conda install create -n retroprime_env python=3.6
conda activate retroprime_env
conda install -c conda-forge rdkit
conda install pandas tqdm six
conda install pytorch==1.5.0 torchvision cudatoolkit=10.1 -c pytorch

Then,

cd RetroPrime_root/
pip install -e .
cd RetroPrime_root/retroprime/dataprocess/packeage/SmilesEnumerator/
pip install -e .

This step installs the Smiles enumerator. https://github.com/EBjerrum/SMILES-enumeration, or you can also use RDKit's own enumeration function to replace these parts of this code.

Then,

cd RetroPrime_root/retroprime/transformer_model/
pip install -e .

Dataset

USPTO-50K: https://github.com/connorcoley/retrosim/blob/master/retrosim/data/data_processed.csv
USPTO 1976_sep2016: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873

Data Processing

cd RetroPrime_root/
mkdir RetroPrime_root/databox
cd RetroPrime_root/retroprime/dataprocess/

You can put the data set (csv) in the path shown below:

RetroPrime_root/databox/dataset_name/dataset.csv

You can follow the sequence number of the *.sh script. Or you can change the file path in the script to handle your own reaction data. Reaction dataset like this:

id,reactants>reagents>production,dataset
US09371281B2,[OH-:1].O[NH2:4].C[O:6][C:7](=O)[c:9]1[cH:10][cH:11][c:12]2[cH:13][cH:14][n:15]([CH2:18][c:19]3[cH:20][cH:21][c:22]([O:25][CH3:26])[cH:23][cH:24]3)[c:16]2[cH:17]1>>[OH:1][NH:4][C:7](=[O:6])[c:9]1[cH:10][cH:11][c:12]2[cH:13][cH:14][n:15]([CH2:18][c:19]3[cH:20][cH:21][c:22]([O:25][CH3:26])[cH:23][cH:24]3)[c:16]2[cH:17]1,train
US07842713B2,[C:26]([CH3:27])([CH3:28])([CH3:29])[O:30][C:31](=[O:32])[c:33]1[cH:34][c:35]2[c:36]([c:43]([OH:45])[cH:44]1)[CH2:37][C:38]([CH3:40])([CH2:41][OH:42])[O:39]2.F[c:55]1[cH:54][cH:53][c:52]([S:49]([CH:46]2[CH2:47][CH2:48]2)(=[O:50])=[O:51])[cH:57][cH:56]1>>[C:26]([CH3:27])([CH3:28])([CH3:29])[O:30][C:31](=[O:32])[c:33]1[cH:34][c:35]2[c:36]([c:43]([O:45][c:55]3[cH:54][cH:53][c:52]([S:49]([CH:46]4[CH2:47][CH2:48]4)(=[O:50])=[O:51])[cH:57][cH:56]3)[cH:44]1)[CH2:37][C:38]([CH3:40])([CH2:41][OH:42])[O:39]2,train
US07642277B2,OOC(c1cccc(Cl)c1)=[O:9].[Cl:12][c:13]1[c:14]([CH2:19][CH:20]=[CH2:21])[cH:15][cH:16][cH:17][cH:18]1>>[O:9]1[CH:20]([CH2:19][c:14]2[c:13]([Cl:12])[cH:18][cH:17][cH:16][cH:15]2)[CH2:21]1,val
US04837349,CC(C)(C)[O:8][C:6]([NH:5][C@H:4]([C:3]([O:2][CH3:1])=[O:16])[CH:13]([CH3:14])[CH3:15])=[O:7].O=S(=O)(O[Si:25]([CH3:26])([CH3:27])[C:28]([CH3:29])([CH3:30])[CH3:31])C(F)(F)F>>[CH3:1][O:2][C:3]([C@@H:4]([NH:5][C:6](=[O:7])[O:8][Si:25]([CH3:26])([CH3:27])[C:28]([CH3:29])([CH3:30])[CH3:31])[CH:13]([CH3:14])[CH3:15])=[O:16],test

Train

Tokenlized data preprocess:

dataset=dataset_name_pos_pred
python preprocess.py -train_src data/${dataset}/src-train.txt \
                     -train_tgt data/${dataset}/tgt-train.txt \
                     -valid_src data/${dataset}/src-val.txt \
                     -valid_tgt data/${dataset}/tgt-val.txt \
                     -save_data data/${dataset}/${dataset} \
                     -src_seq_length 1000 -tgt_seq_length 1000 \
                     -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

dataset=dataset_name_S2R
python preprocess.py -train_src data/${dataset}/src-train.txt \
                     -train_tgt data/${dataset}/tgt-train.txt \
                     -valid_src data/${dataset}/src-val.txt \
                     -valid_tgt data/${dataset}/tgt-val.txt \
                     -save_data data/${dataset}/${dataset} \
                     -src_seq_length 1000 -tgt_seq_length 1000 \
                     -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

train the two stage model:

dataset=dataset_name_pos_pred
python  train.py -data data/${dataset}/${dataset} \
                   -save_model experiments/checkpoints/${dataset}/${dataset}_model \
                   -seed 42 -gpu_ranks 2 -save_checkpoint_steps 10000 -keep_checkpoint 20 \
                   -train_steps 250000 -param_init 0  -param_init_glorot -max_generator_batches 32 \
                   -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0  -accum_count 4 \
                   -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000  \
                   -learning_rate 2 -label_smoothing 0.0 -report_every 1000 \
                   -layers 13 -rnn_size 500 -word_vec_size 500 -encoder_type transformer -decoder_type transformer \
                   -dropout 0.1 -position_encoding -share_embeddings \
                   -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \
                   -heads 8 -transformer_ff 2048

dataset=dataset_name_S2R
python  train.py -data data/${dataset}/${dataset} \
                   -save_model experiments/checkpoints/${dataset}/${dataset}_model \
                   -seed 42 -gpu_ranks 2 -save_checkpoint_steps 10000 -keep_checkpoint 20 \
                   -train_steps 250000 -param_init 0  -param_init_glorot -max_generator_batches 32 \
                   -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0  -accum_count 4 \
                   -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000  \
                   -learning_rate 2 -label_smoothing 0.0 -report_every 1000 \
                   -layers 13 -rnn_size 500 -word_vec_size 500 -encoder_type transformer -decoder_type transformer \
                   -dropout 0.1 -position_encoding -share_embeddings \
                   -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \
                   -heads 8 -transformer_ff 2048                   

Use model for prediction

First, download the model file we trained https://drive.google.com/file/d/1-715B8jU0rRC3YaY4p6URQcgjcRG2OlV/view?usp=sharing
unzip it and put it under RetroPrime_root/retroprime/transformer_model/experiments/

We provide a template run_example.sh like this:

#!/usr/bin/env bash

input_file=$1
output_dir=$2
cache_dir=${output_dir}/cache
beam_size=$3
core=8
gpu=0
if [ ! -e ${output_dir} ];
then
    mkdir -p ${output_dir}
fi
if [ ! -e ${cache_dir} ];
then
    mkdir -p ${cache_dir}
fi

transformer_root=retroprime/transformer_model
model_save_path=${transformer_root}/experiments/checkpoints
to_stage2_scritp_root=${transformer_root}/script
model_P2S=${model_save_path}/USPTO-50K_pos_pred/USPTO-50K_pos_pred_model.pt
model_S2R=${model_save_path}/USPTO-50K_S2R/USPTO-50K_S2R_model.pt

python ${to_stage2_scritp_root}/smi_tokenizer.py -input $input_file \
                  -output ${output_dir}/canonical_token_for_input.txt
echo Products to Synthons
CUDA_VISIBLE_DEVICES=${gpu} python ${transformer_root}/translate.py -gpu ${gpu} \
                    -model ${model_P2S} \
                    -src ${output_dir}/canonical_token_for_input.txt \
                    -output ${output_dir}/synthon_predicted.txt \
                    -batch_size 64 -replace_unk -max_length 200 -beam_size ${beam_size} -n_best ${beam_size}
python ${to_stage2_scritp_root}/evaluate.py -beam_size ${beam_size}\
              -src_file ${output_dir}/canonical_token_for_input.txt \
              -pre_file ${output_dir}/synthon_predicted.txt \
              -save_rank_results_file ${cache_dir}/pre_synthons_rank.csv \
              -save_top ${cache_dir}/pre_synthons_top_results.csv \
              -write_to_step2  \
              -core ${core} \
              -step2_save_file ${output_dir}/to_synthon_tokenlized.txt
echo Synthons to Reactants
CUDA_VISIBLE_DEVICES=${gpu} python ${transformer_root}/translate.py -gpu ${gpu} \
                    -model ${model_S2R} \
                    -src ${output_dir}/to_synthon_tokenlized.txt \
                    -output ${output_dir}/reactants_predicted.txt \
                    -batch_size 64 -replace_unk -max_length 200 -beam_size ${beam_size} -n_best ${beam_size}
python ${to_stage2_scritp_root}/mix_c2c_top3_after_rerank.py \
                    -pre_file ${output_dir}/reactants_predicted.txt \
                    -mix_save_file  ${output_dir}/reactants_predicted_mix.txt \
                    -beam_size=${beam_size} -core ${core}

You can prepare the input file (TXT format) and specify the path of the trained two-stage model.

input file name: INPUT.txt

...
CCCCCC(=O)OCCCC
c1ccccc1C(=O)OC
...

Then:

./run_example.sh INPUT.txt OUTPUT_FOLDER 10

Then the predicted results are in OUTPUT_FOLDER/reactants_predicted_mix.txt