Potential bug in write_structural_prior

I looked over the code again and it looks fine to me (at least as far as the concern above goes):

It doesn't matter if the model generates a smile once or multiple times. For each smile in the test set, we match it against a smile from the model (and also PubChem and the training set, but that's not important here..). If we find smile(s) within a small window of the molecular mass of this test smile, then we start assigning ranks (in the target_rank column) based on inchikey match.

The n_candidates=match.shape[0] tells us how many potential matches there were, based purely on molecular mass. The ,0 that we see in most of the rows in test_data/snakemake_output/0/prior/structural_prior/LOTUS_truncated_SMILES_all_freq-avg_CV_ranks_structure.csv is because that particular row did not have a match in terms of molecular mass in the target_source source. In all these cases target_rank will be blank (or np.nan).

So one way to interpret each row in *_ranks_structure.csv would be - look at the n_candidates column. These are the number of smiles in target_source that could potentially match this test smile based on molecular mass. If this is >0, then look at target_rank column, if it is blank, then the test smile is not a match for any of these candidates (based on inchikey). If specified, then it was found at at that rank (between 0 and n_candidates-1).

Let me know if I'm missing your concern here.

skinniderlab / CLM

Potential bug in write_structural_prior #138