Open fiberleif opened 5 days ago
since in the uspto-Full dataset, augmentation size = 5, and the default beam size = 10, so ideally we have 50 outputs for each test sample, then l used the below command to evaluate the released checkpoint:
python score.py \ -beam_size 10 \ -n_best 50 \ -augmentation 5 \ -targets ./USPTO_full_PtoR_aug5/test/tgt-test.txt \ -predictions ./USPTO_full_PtoR-translate-results-20240705.txt \ -process_number 8 \ -score_alpha 1 \ -save_file ./full_eval_results.txt \ -source ./USPTO_full_PtoR_aug5/test/src-test.txt
It seems that for k=1, 3, 5, 10, my result is close to the report number from the paper, but for k=20, 50, the difference seems to be large, so l am wondering if there are some misunderstandings from my side?
The argument n_best
and beam_size
usually are the same. You could try using beam_size=50 when predicting and scoring.
The argument
n_best
andbeam_size
usually are the same. You could try using beam_size=50 when predicting and scoring.
Thanks for your kind reply, l successfully reproduced the top-50 results using beam_size = 50 and your processed test set (size = 96023, not 101311 in raw_test.csv).
Did you use this filtered test set for all the baselines in Table 5, or not?
The argument
n_best
andbeam_size
usually are the same. You could try using beam_size=50 when predicting and scoring.Thanks for your kind reply, l successfully reproduced the top-50 results using beam_size = 50 and your processed test set (size = 96023, not 101311 in raw_test.csv).
Did you use this filtered test set for all the baselines in Table 5, or not?
If the result was implemented by us, it should be the same test set.
Thank you! Does "implemented by us" mean "marked with a 'c'" symbol (which is explained in the paper as "Denotes that the result is implemented by the open-source code with well-tuned hyperparameters")?
In this case that would be only LocalRetro?
Thank you! Does "implemented by us" mean "marked with a 'c'" symbol (which is explained in the paper as "Denotes that the result is implemented by the open-source code with well-tuned hyperparameters")?
In this case that would be only LocalRetro?
Yes. If you are interested in the result of original dataset, you could be free to have a try.
Dear @otori-bird and @fiberleif I'm trying to replicate R-SMILES on the USPTO FULL dataset as well. I followed all the instructions mentioned in the paper, but I couldn't achieve the same results. I used a beam size and n-best of 50, and the training was conducted on a V100. Do you have any recommendations or suggestions for improving the results or identifying potential issues in my approach?
Full dataset l was using (from Google drive shared by authors):
Training script l was using:
After training, l also used the average checkpoint script to obtain the final checkpoint for inference & scoring:
Inference script l was using:
Thank you for your assistance. I will repeat the experiment, as the training was interrupted three times, which might have caused the issue. Interestingly, when I replicated the USPTO-MIT model, I achieved the expected results.
@otori-bird Dear author, l found that the released checkpoint of USPTO full dataset (aka, USPTO_full_PtoR.pt) has total parameter size of 44,529,405, but when l was using the default train from scratch script: https://github.com/otori-bird/retrosynthesis/blob/main/train-from-scratch/PtoR/PtoR-Full-aug5-config.yml. The model has 44,501,739 total parameters.
l used the same OpenNMT-py==2.2.0 as denoted in the Readme.txt. Could you please tell why they are different size, or how did you train your released checkpoint for the uspto-full dataset?
Dear authors,
Thanks for releasing your code. Regarding the top-20, and top-50 results in the Readme file, can you tell how did you obtain this?