We should be using train0 here

vineetbansal commented 4 months ago

https://github.com/skinniderlab/CLM/blob/8cf35da0a08838b1b159227020de6ddf0f1c6ec0/snakemake/Snakefile#L129

vineetbansal commented 4 months ago

@skinnider - initially this step was taking the testing data from across folds without augmentation, and taking training data from across all folds with augmentation.

We're now saying that write_structural_prior_CV_overall should report the efficacy of unaugmented test smiles against unaugmented training smiles across all folds (in addition to PubChem/model). But aren't these 2 (unaugmented test smiles/unaugmented train smiles) the same set, when looking across folds? This would mean that model=train is expected to report the best topk results here (in fact as good as can be expected), and the "real" comparison is between the model and PubChem/Addcarbon.

This makes intuitive sense, but is slightly different from what we were doing before, so just trying to make sure I understand..

skinnider commented 4 months ago

Good catch, but I think this should be fine if things have worked correctly in process_tabulated_molecules/collect_tabulated_molecules: compounds (inchikeys) that were in the training set should be assigned a frequency of NA, which means they won't contribute to the average frequency across folds. Instead, the frequency for compounds in the training set will be based solely on the one fold that didn't contain this compound in the training set. If I understood right, this is what @anushka255 and I were talking about today. So, by definition, searching against the training set should yield a top-k accuracy of 0% for any value of k. Let me know if this is unclear...

skinnider commented 4 months ago

sorry I just realized I misunderstood your point and see what you're saying. I agree this is a problem. For model="train", is the simplest thing simply to sum the results over individual CV folds?

vineetbansal commented 4 months ago

ah I think your previous comment was for model=generative_model, not model=train?

For model=train, I wasn't suggesting there's a problem per se - if the results are interpretable (i.e. all target_ranks are expected to mostly be 0) I can submit a PR with that change. If by problem you mean we'll be doing unnecessary file generation/computation then I agree - perhaps I can submit a PR now, and handle those optimizations as a separate step.

skinnider commented 4 months ago

Yes, that's right. I think you were right that there is a problem. The goal is that model=train produces a top-k accuracy of zero, since by definition, the training dataset will never identify "novel" compounds (although it will often produce reasonably close matches, as quantified by the Tanimoto coefficient, as seen in the attached plot).

I see a few potential workarounds:

Simply aggregate the results from individual cross-validation folds for model=train (e.g. concatenate together the write_structural_prior_CV outputs). This is fine because unlike with the generative model, there is no sampling frequency, and each compound should be in the held-out set exactly once.
Read in all test files for the entire CV split and dynamically exclude the entire relevant fold for model=train in write_structural_prior_CV_overall based on the current row What do you think?

vineetbansal commented 3 months ago

Thanks - we went with option 1 above, and its merged in commit 1ff59b0 on the master branch.

skinniderlab / CLM

We should be using train0 here #140