Reproducing CANOPUS benchmark

roman-bushuiev commented 1 year ago

Dear Sam,

Very interesting work and it is great that you made the data and code publicly available. I am trying to reproduce the evaluation on the CANOPUS benchmark proposed in section 2.6 and have a few questions. I would appreciate your help in clarifying them:

When retrieving candidate molecules, do you consider only a set of isomers with unique fingerprints or all the PubChem formula isomers? I am asking because it significantly impacts the accuracy at k > 1.
Do you compute accuracy based on the first 14 characters of the InChiKey?
The folder canopus_train_public/retrieval_hdf seems to be empty. Do I understand it correctly that I should obtain candidate isomers by simply filtering cid_smiles.txt by the ground truth chemical formulas for each sample?
Could you please explain the following lines? Probably they answer my first question but I cannot understand their meaning. :) For all ties, the optimistic lower rank of the tied options is chosen. Ties are broken by selecting the minimum rank.

Thank you in advance!

Roman

samgoldman97 commented 1 year ago

Hi Roman, Thanks for the interest. Answers below. Feel free to drop me an email if you want to find time to talk further. I'm also hoping to update the code base and make it all a little easier in the coming weeks. These ambiguities are part of the reason I think the fingerprint prediction metric is so useful as a task to test neural network encoder strength. If you can outperform on that metric with the same training data, I think you've built a powerful model!

All formula isomers (see answer for question 4 for how we account for this)
No, I used the entire inchikey. That said, I remove stereochemistry information from all the smiles, so I believe it is functionally equivalent to comparing only on the first 14 chars.
Yes-- you are correct. These files were quite large so I did not post them. See preprocessing/pubchem to wrangle pubchem cid_smiles.txt into the formula file, then preprocessing/canopus_train_public for how i turned that into retrieval hdf files.
This was my compromise to option 1. Let's say the model computes equivalent distances for rank ordered inchikeys 1 and 2 in the retrieval library and smiles 2 is the true inchikey. In all evaluations, we referred to this as correctly retrieved at index 1. Not necessarily the right way to do it and will consider changing in V2 of the code (I acknowledge you can get high performance by predicting distance = 0 for eveyrthing), but at least for the purposes of this work, we applied the same scheme to all methods for comparison. See analysis/retrieval/extract_ranking.py.

Best, Sam

roman-bushuiev commented 1 year ago

Thank you for your quick response! Now everything is clear. It would be definitely great to talk further once I properly benchmark my model. Regarding 4., I believe that such "retrieval at index" is an even more reasonable performance measure than the rank ordering of all candidates. Since the predicted query fingerprint is continuous (not rounded to 0s and 1s) and reference fingerprints are binary, equal cosine distances during the retrieval almost certainly imply the equality of reference fingerprints. But identical 4096-bit Morgan fingerprints for different PubChem compounds are possible (I believe) only for stereoisomers, which are first of all the bottleneck of MS/MS but not the subsequent ML tools. :)

roman-bushuiev commented 1 year ago

Hi, @samgoldman97! One more question. When reporting scores on a test fold, do you retrain the model including a validation fold, or do you simply use the best model after hyperparameter tuning on a validation fold? Thank you!

samgoldman97 commented 1 year ago

Hi Roman, We do 1 hyperparameter optimization against the validation split, then we retrain models separately on the train/val set for each fold before evaluating on the actual test sets.

Also, if you didn't see, I have very much simplified the pipeline and re-run the results as can be seen in the notebooks. This no longer relies on SIRIUS for sub-formulae labeling and simplifies the pipeline. Should be much more streamlined for benchmarking, and as you recommended, have switched to "worst case" retrieval, which does reduce the magnitude of numbers for all methods (moreso on CANOPUS dataset than for CSI/NIST).

Can I ask out of curiosity, have you been able to replicate some of the results and how has it been going?

Sam

roman-bushuiev commented 1 year ago

Thanks a lot for the quick reply. It is very nice, I haven't noticed the updated notebooks before. I am writing you an email regarding my results. :)

samgoldman97 / mist

Reproducing CANOPUS benchmark #8