otori-bird / retrosynthesis

MIT License
55 stars 13 forks source link

Pretraining on USPTO dataset #12

Closed shuan4638 closed 15 hours ago

shuan4638 commented 4 months ago

Congrats for the great work! I got questions about the model pretraining.

I found that although this work emphasize the importance of how R-SMILES can boost the retrosynthesis prediction performance, I doubt the main contribution of the performance boost seems to come from the pretraining on USPTO-full dataset.

I had this assumption after seeing Figure S2, where the top-1 accuracy without pretraining is ~54% after 30K training steps, which is slightly higher than AT (53.5%) but lower than the reported accuracy (56.3%). Even Chemformer (pretrained on 100M SMILES) only shows ~1% improvement at top-1 accuracy but a huge drop (>10%) at top-10 accuracy. With that being said, does the author consider the improvement comes from R-SMILES or more from the choose of pretraining dataset?

In addition, I couldn't find where the script removes the molecules in USPTO-50K and USPTO_MIT datasets during model pretraining as claimed in the paper. If this step is missing, using the ground truth to pretrain the model could lead to a serious data leakage issue.

4r33x commented 4 months ago

In addition, I couldn't find where the script removes the molecules in USPTO-50K and USPTO_MIT datasets during model pretraining as claimed in the paper. If this step is missing, using the ground truth to pretrain the model could lead to a serious data leakage issue.

I believe that this cannot cause a leak, since at the pre-training stage the model is trained not on the task of retrosynthesis, but on the translation of masked molecules into original ones.

shuan4638 commented 4 months ago

If the model is pre-trained to generate the molecules that are included in the answer, the model is more likely to generate the SMILES that the model has seen before when predicting the synthesis route. I assume this is the reason why model pretrained on 1M molecules from USPTO dataset works better than 100M molecules from ZINC database (Chemformer). Of course, this assumption should be verified by further experiment.

Nonetheless, this issue won't be raised if the ground truth molecules were totally removed during pretraining (as claimed in the paper). Since I could not find the line where those molecules are removed, could you tell me where that process is done in this repo?

Thank you.

shuan4638 commented 1 week ago

Thanks for updating the code to exclude the data in USPTO-50K and USPTO-MIT datasets. I learned that you did exclude USPTO-50K and USPTO-MIT data when you trained the model, but this part was mistakenly removed during code cleaning. Just to check the results again, I tried to reproduce the pretrain-finetune results by following the instructions on the README.md file. However, I got three bugs and I fixed them by myself:

  1. No averaged model generated: FileNotFoundError: [Errno 2] No such file or directory: 'exp/USPTO_50K_PtoR_aug20/average_model_56-60.pt' It seems that the file average_model_56-60.pt was expected to be generated but not generated by the code. Therefore, I simply replace this model path with the last checkpoint exp/USPTO_50K_PtoR_aug20/finetune_model.product-reactants_step_300000.pt

  2. Typo of "augmentation" score.py: error: unrecognized arguments: -augmenation 20 ./dataset/USPTO_50K_PtoR_aug20/test/src-test.txt The argument augmentation was mistyped as augmenation in the README.md. I fixed it manually.

  3. Not defined argument in function canonicalize_smiles_clear_map NameError: name 'opt' is not defined The function Chem.MolFromSmiles(smiles,sanitize=not opt.synthon) caused error because the argument opt is not defined in the canonicalize_smiles_clear_map. I modified the code to Chem.MolFromSmiles(smiles,sanitize=True) since the default value is of opt.synthon is True.

Other than these 3 modifications, I followed the exact procedure of your instructions and got the following results after running score.py: image

Top-1 accuracy is 16.976% and Top-10 accuracy is 25.384%, which is different from the results shown in the paper. Could you explain the difference?

4r33x commented 1 week ago

Thanks for updating the code to exclude the data in USPTO-50K and USPTO-MIT datasets. I learned that you did exclude USPTO-50K and USPTO-MIT data when you trained the model, but this part was mistakenly removed during code cleaning. Just to check the results again, I tried to reproduce the pretrain-finetune results by following the instructions on the README.md file. However, I got three bugs and I fixed them by myself:

  1. No averaged model generated: FileNotFoundError: [Errno 2] No such file or directory: 'exp/USPTO_50K_PtoR_aug20/average_model_56-60.pt' It seems that the file average_model_56-60.pt was expected to be generated but not generated by the code. Therefore, I simply replace this model path with the last checkpoint exp/USPTO_50K_PtoR_aug20/finetune_model.product-reactants_step_300000.pt
  2. Typo of "augmentation" score.py: error: unrecognized arguments: -augmenation 20 ./dataset/USPTO_50K_PtoR_aug20/test/src-test.txt The argument augmentation was mistyped as augmenation in the README.md. I fixed it manually.
  3. Not defined argument in function canonicalize_smiles_clear_map NameError: name 'opt' is not defined The function Chem.MolFromSmiles(smiles,sanitize=not opt.synthon) caused error because the argument opt is not defined in the canonicalize_smiles_clear_map. I modified the code to Chem.MolFromSmiles(smiles,sanitize=True) since the default value is of opt.synthon is True.

Other than these 3 modifications, I followed the exact procedure of your instructions and got the following results after running score.py: image

Top-1 accuracy is 16.976% and Top-10 accuracy is 25.384%, which is different from the results shown in the paper. Could you explain the difference?

I am not one of authors of the paper, but I recreated results from this paper with new code, that I wrote by myself. I trained models with huggingface libraries without using pretrain step (just train from scratch) and results are pretty similar (with proper scoring - slightly better than chemformer). There is certainly something wrong with your setup or/and new bugs introduced in repository or Python dependencies.

shuan4638 commented 23 hours ago

I have pretrained the model again and fine-tuned it for retrosynthesis by strictly following the instructions in the README.md. This time, I averaged the models using the script pretrain_finetune/finetune/PtoR/PtoR-50K-aug20-average.sh and scored the predictions with opt.synthon set to True. The results are similar to my previous attempts:

image

@4r33x, I understand that you can achieve good results by reproducing them with your new code. However, the original question I raised in this issue was about the inconsistency between the data preparation code before the last update and the claims in the paper. Following the provided instructions did not yield the expected results. Therefore, I believe it would be best to wait for a reply from the original authors.

otori-bird commented 21 hours ago

I have pretrained the model again and fine-tuned it for retrosynthesis by strictly following the instructions in the README.md. This time, I averaged the models using the script pretrain_finetune/finetune/PtoR/PtoR-50K-aug20-average.sh and scored the predictions with opt.synthon set to True. The results are similar to my previous attempts:

image

@4r33x, I understand that you can achieve good results by reproducing them with your new code. However, the original question I raised in this issue was about the inconsistency between the data preparation code before the last update and the claims in the paper. Following the provided instructions did not yield the expected results. Therefore, I believe it would be best to wait for a reply from the original authors.

Thanks for your attention to our work. As far as i see, you are training a P2R model that should only be used for predicting the desired reactants with the product. In this case, the option opt.synthon should not be used if you are not predicting the synthons. You could score the results with opt.synthon set to False

otori-bird commented 21 hours ago

Thanks for updating the code to exclude the data in USPTO-50K and USPTO-MIT datasets. I learned that you did exclude USPTO-50K and USPTO-MIT data when you trained the model, but this part was mistakenly removed during code cleaning. Just to check the results again, I tried to reproduce the pretrain-finetune results by following the instructions on the README.md file. However, I got three bugs and I fixed them by myself:

  1. No averaged model generated: FileNotFoundError: [Errno 2] No such file or directory: 'exp/USPTO_50K_PtoR_aug20/average_model_56-60.pt' It seems that the file average_model_56-60.pt was expected to be generated but not generated by the code. Therefore, I simply replace this model path with the last checkpoint exp/USPTO_50K_PtoR_aug20/finetune_model.product-reactants_step_300000.pt
  2. Typo of "augmentation" score.py: error: unrecognized arguments: -augmenation 20 ./dataset/USPTO_50K_PtoR_aug20/test/src-test.txt The argument augmentation was mistyped as augmenation in the README.md. I fixed it manually.
  3. Not defined argument in function canonicalize_smiles_clear_map NameError: name 'opt' is not defined The function Chem.MolFromSmiles(smiles,sanitize=not opt.synthon) caused error because the argument opt is not defined in the canonicalize_smiles_clear_map. I modified the code to Chem.MolFromSmiles(smiles,sanitize=True) since the default value is of opt.synthon is True.

Other than these 3 modifications, I followed the exact procedure of your instructions and got the following results after running score.py: image

Top-1 accuracy is 16.976% and Top-10 accuracy is 25.384%, which is different from the results shown in the paper. Could you explain the difference?

Thanks for your detailed report. It's really valuable and we will fix these problems in next update. For the problem 3, your understanding is basically correct. Chem.MolFromSmiles(smiles,sanitize=True) is the default setting for rdkit and should be used for most cases. However, it can not recognize the synthon correctly. So the sanitize should be False when it is used for scoring the accuracy of synthons.

shuan4638 commented 17 hours ago

I actually used sanitize=False at my second trail, the previous reply was a typo. The accuracy shown in the screenshot is obtained by the correct settings.

otori-bird commented 17 hours ago

I actually used sanitize=False at my second trail, the previous reply was a typo. The accuracy shown in the screenshot is obtained by the correct settings.

If you score the reactants with the correct setting and cannot get the desired results, which means the option opt.synthon is set to to be False and sanitize=True, here is my advice:

  1. Check your performance without pretraining on the USPTO-full dataset to make sure there is no mistake in the P2R training step. You can use the train-from-scratch scripts.
  2. In your screenshot, I found huge percentage of invalid SMILES, which is quite strange. According to the research of Zheng et al.(Predicting Retrosynthetic Reactions Using Self-Corrected Transformer Neural Networks), the invalid rate of top-10 accuracy of the P2R model trained with vanilia transformer and canonical SMILES is about 22%. You should check why it happens.

Looking forward to your further result.

shuan4638 commented 15 hours ago

I set the opt.synthon to False and sanitize to True. I also found the reason of the high invalid SMILES rate: Because I only use the pretraining molecules to build the vocabularies to prevent any possible leakage during pretraining, the token . was not in the vocabularies of the pretrained model. Therefore, the predictions look like the following: C C ( = O ) c 1 c c c 2 [nH] c c c 2 c 1 <unk> C ( = O ) ( O C ( = O ) O C ( C ) ( C ) C ) O C ( C ) ( C ) C

After replacing <unk> to ., I got a much better result: image Now the results look much closer to the ones shown in the paper. Now it's clear to me that the results are not significantly improved due to the pretraining on test set. Thanks and again congrats to the team for the great work. Really impressive.