knu-lcbc / RetroTRAE

Retrosynthetic prediction with Atom Environments
Other
37 stars 8 forks source link

Dataset comparison - USPTO full vs USPTO MIT #6

Closed fredhastedt closed 1 year ago

fredhastedt commented 1 year ago

Hi,

this is a very interesting approach. Well done!

I have one concern regarding the comparison w.r.t other existing methods. For evaluation, you made use of the USPTO-MIT dataset. This is commonly used for reaction prediction (forward). I saw that in the works of the self-correcting transformer and AutoSynRoute, they used the same dataset. However, for other retrosynthesis algorithms, GLN, AT, Retrosim and Retroprime were trained and evaluated on a dataset double the size (USPTO-full curated by Dai et. al.). Would you not agree that a comparison here is a bit unfair as the evaluation on a dataset 2x the size is surely more difficult?

Thank you for clarifying this.

fredhastedt commented 1 year ago

Just to add one thing: In ref 33 (AutoSynRoute), they evaluated neuralysm as 47.8% for the MIT dataset. Dai et. al. evaluated neuralysm at 35.8% on the full dataset. Given that both authors implemented neuralysm in the same way, it seems that the full dataset is way harder to perform well on.

azpisruh commented 1 year ago

Hello, I appreciate your interest and inquiry regarding the comparison of our approach to other existing methods.

While I understand your concern, I would not fully concur with the assertion that "the evaluation on a dataset 2x the size is surely more difficult." The relationship between dataset size and accuracy is multifaceted, influenced by various factors, including the quality of the additional data and the training procedure employed.

To illustrate, one could argue that evaluating performance on the USPTO-MIT dataset should be considerably more challenging than on the well-curated USPTO-50K dataset, given that the former is approximately ten times larger. However, this argument is not corroborated by the results reported in very same ref 33, which demonstrated that Segler-Corey achieved 47.8% accuracy on the MIT dataset versus 38.7% on the 50K dataset, and AutoSynRoute attained 54.1% accuracy on the MIT dataset compared to 43.1% on the 50K dataset.

The discrepancy between the neuralysm results you mentioned could potentially be explained if the USPTO-full dataset is noisier than the USPTO-MIT dataset.

As you pointed out, the lack of consensus on dataset selection can indeed result in inconsistencies when comparing various studies. We have acknowledged this issue by including a footnote stating that some of the results in the comparison table are derived from either the USPTO-full or USPTO-MIT datasets.

fredhastedt commented 1 year ago

Hello,

Thank you very much for your prompt reply. Indeed, you are correct that I was wrong in inferring that a larger dataset would be harder to perform on. I should have referred, as you pointed out, to the noisiness/sparsity of the dataset.

In Dai et. al., they say that:

Despite the noisiness of the full USPTO set relative to the clean USPTO-50k, our method still outperforms the two best baselines in top-k accuracies.

Possibly, this is the indication that this dataset is harder to perform on (as it was curated by Dai et. al.).

Regarding the footnote, are you referring to this?

The results are based on either filtered MIT-full [46,47] or MIT-fully atom mapped [15] reaction datasets.

If so, there are no superscripts/subscripts which indicate which of the algorithms are performed on USPTO-full and USPTO-MIT. That is why I got confused and possibly other readers will get, too.

The performance difference between the two datasets becomes more evident in this paper: "Root-aligned SMILES: a tight representation for chemical reaction prediction" by Zhong et. al. Here we can see a performance decrease of (12-15%), which is rather large. Of course, your approach might not exhibit the same; however, I still believe that the comparison in Table 3 is misleading for the reader.

azpisruh commented 1 year ago

Hi,

Regarding the footnote, I agree that we should have been more explicit in indicating which methods were performed on the USPTO-full and USPTO-MIT datasets. The current form of the table may not be sufficiently clear for readers. I hope that this issue created on GitHub may help to compensate for the confusion. Thank you for your valuable feedback.

Additionally, I appreciate the reference to the paper by Zhong et al., which highlights the performance difference between the two datasets.

Best.

fredhastedt commented 1 year ago

Hi,

Thanks for the clarification. Again, well done for this very interesting framework.