Closed roman-bushuiev closed 1 year ago
Hi Roman, Thanks for the interest. Answers below. Feel free to drop me an email if you want to find time to talk further. I'm also hoping to update the code base and make it all a little easier in the coming weeks. These ambiguities are part of the reason I think the fingerprint prediction metric is so useful as a task to test neural network encoder strength. If you can outperform on that metric with the same training data, I think you've built a powerful model!
preprocessing/pubchem
to wrangle pubchem cid_smiles.txt into the formula file, then preprocessing/canopus_train_public
for how i turned that into retrieval hdf files.analysis/retrieval/extract_ranking.py
. Best, Sam
Thank you for your quick response! Now everything is clear. It would be definitely great to talk further once I properly benchmark my model. Regarding 4., I believe that such "retrieval at index" is an even more reasonable performance measure than the rank ordering of all candidates. Since the predicted query fingerprint is continuous (not rounded to 0s and 1s) and reference fingerprints are binary, equal cosine distances during the retrieval almost certainly imply the equality of reference fingerprints. But identical 4096-bit Morgan fingerprints for different PubChem compounds are possible (I believe) only for stereoisomers, which are first of all the bottleneck of MS/MS but not the subsequent ML tools. :)
Hi, @samgoldman97! One more question. When reporting scores on a test fold, do you retrain the model including a validation fold, or do you simply use the best model after hyperparameter tuning on a validation fold? Thank you!
Hi Roman, We do 1 hyperparameter optimization against the validation split, then we retrain models separately on the train/val set for each fold before evaluating on the actual test sets.
Also, if you didn't see, I have very much simplified the pipeline and re-run the results as can be seen in the notebooks. This no longer relies on SIRIUS for sub-formulae labeling and simplifies the pipeline. Should be much more streamlined for benchmarking, and as you recommended, have switched to "worst case" retrieval, which does reduce the magnitude of numbers for all methods (moreso on CANOPUS dataset than for CSI/NIST).
Can I ask out of curiosity, have you been able to replicate some of the results and how has it been going?
Sam
Thanks a lot for the quick reply. It is very nice, I haven't noticed the updated notebooks before. I am writing you an email regarding my results. :)
Dear Sam,
Very interesting work and it is great that you made the data and code publicly available. I am trying to reproduce the evaluation on the CANOPUS benchmark proposed in section 2.6 and have a few questions. I would appreciate your help in clarifying them:
canopus_train_public/retrieval_hdf
seems to be empty. Do I understand it correctly that I should obtain candidate isomers by simply filteringcid_smiles.txt
by the ground truth chemical formulas for each sample?For all ties, the optimistic lower rank of the tied options is chosen. Ties are broken by selecting the minimum rank.
Thank you in advance!
Roman