skinniderlab / CLM

MIT License
0 stars 0 forks source link

Potential bug in write_structural_prior #138

Closed anushka255 closed 4 months ago

anushka255 commented 4 months ago

Based on the following line of code, any smiles generated exactly once by the model will be assigned a frequency of 0. https://github.com/skinniderlab/CLM/blob/8cf35da0a08838b1b159227020de6ddf0f1c6ec0/src/clm/commands/inner_write_structural_prior_CV.py#L135

vineetbansal commented 4 months ago

I looked over the code again and it looks fine to me (at least as far as the concern above goes):

It doesn't matter if the model generates a smile once or multiple times. For each smile in the test set, we match it against a smile from the model (and also PubChem and the training set, but that's not important here..). If we find smile(s) within a small window of the molecular mass of this test smile, then we start assigning ranks (in the target_rank column) based on inchikey match.

The n_candidates=match.shape[0] tells us how many potential matches there were, based purely on molecular mass. The ,0 that we see in most of the rows in test_data/snakemake_output/0/prior/structural_prior/LOTUS_truncated_SMILES_all_freq-avg_CV_ranks_structure.csv is because that particular row did not have a match in terms of molecular mass in the target_source source. In all these cases target_rank will be blank (or np.nan).

So one way to interpret each row in *_ranks_structure.csv would be - look at the n_candidates column. These are the number of smiles in target_source that could potentially match this test smile based on molecular mass. If this is >0, then look at target_rank column, if it is blank, then the test smile is not a match for any of these candidates (based on inchikey). If specified, then it was found at at that rank (between 0 and n_candidates-1).

Let me know if I'm missing your concern here.