skinniderlab / CLM

MIT License
0 stars 0 forks source link

nearest-neighbor Tc, ever vs. never generated #217

Open skinnider opened 1 week ago

skinnider commented 1 week ago

writing out the points that @vineetbansal and I talked about earlier this afternoon... the idea is, for each molecule in the test set, to record the Tc between this molecule and its nearest-neighbors in the training set. These Tc's are then compared for test set molecules that were generated (i.e., rank is not NA/Inf) vs. those never generated by the CLM (i.e. rank NA/Inf). In the context of cross-validation, the test set is a single fold (e.g. fold 10) and the training set is the remaining folds (e.g. folds 1-9). Therefore, the write_nn_tc.py script should be run with two different files as input, which I believe is not currently the case.

These analyses are depicted in Fig. 4 of the DarkNPS paper:

image

Fig. 4 | Automated structure elucidation of unidentified NPSs. a, Proportion of molecules within the set of 194 NPSs added to the HighResNPS database between October 2020 and April 2021 that appeared at least once within a sample of one billion SMILES strings from the generative model. b, Tanimoto coefficients between held-out NPSs and their nearest neighbour in the training set, for molecules in the held-out set that were generated at least once versus molecules in the held-out set that were never generated.