compare by inchikey, not canonical smiles

skinnider commented 2 months ago

Currently, canonical SMILES are being used to ...

split the training set into cross-validation folds (create_training_sets.py)
tabulate the frequency with which unique molecules appear in any given train_seed/sample_seed output (inner_tabulate_molecules.py)
tabulate the frequency with which unique molecules across train_seed/sample_seed outputs (inner_collect_tabulated_molecules.py)
average molecules across CV folds (inner_process_tabulated_molecules.py)
remove training set molecules from model output (inner_collect_tabulated_molecules, inner_process_tabulated_molecules.py)
calculate top-k accuracy (inner_write_structural_prior_CV)

(Did I miss any places?)

Occasionally, two or more canonical SMILES can represent the same compound, due to a phenomenon called tautomerism. This can happen either in the training set (i.e. we might have two different canonical SMILES representing the same InChIkey in the training set) or in the model output (i.e. the model might generate multiple different SMILES that all represent the same InChIkey). Therefore, the inchikey is a more robust way to perform all of the above operations.

On the other hand, the inchikey is not a unique representation of a molecule (i.e. we can't parse an inchikey into a molecule in rdkit in the same way that we can with SMILES), so we do need to keep track of one (representative) canonical SMILES per inchikey in all of the above steps.

@vineetbansal let me know if anything here is unclear.

skinnider commented 2 months ago

An example of what this looks like in inner-canonicalize-molecules.py (sent via slack, analogous to inner_tabulate_molecules.py):

            # roundtrip to get canonical smiles
            canonical_smiles = Chem.MolToSmiles(mol, isomericSmiles=False)
            # also canonical inchikey
            canonical_inchikey = Chem.inchi.MolToInchiKey(mol)

            # last: number of tokens
            tokens = len(vocab.tokenize(smiles))

            # optionally, skip
            if not canonical_inchikey in train_inchikey:
                # append to file
                row = "\t".join([loss, smiles, str(tokens),
                                 canonical_smiles, canonical_inchikey, 
                                 str(mass), formula])
                _ = f2.write(row + '\n')
                f2.flush()

vineetbansal commented 2 months ago

Thanks @skinnider - the details on the changes made it a lot easier. See PR #133 that addresses all these issues. I've implemented these changes after stepping through the individual steps of the workflow in a debugger. Once I/Anushka modify the test_data so that the tests pass again, we should have more confidence in these changes.

vineetbansal commented 2 months ago

@anushka255 - one way to proceed with the tests here would be to have the code use our clm.utils.inchikey(mol) function instead of rdkit directly. The initial implementation can just return canonical smiles, which should give you identical results as before (with additional columns at every step). A subsequent commit can then call the correct function and we take the new files as the baseline.

Just a thought.

skinnider commented 2 months ago

Not sure if this is best practice from a software perspective, but would some assert statements be useful? e.g., if any of the following conditions are met, something's gone really wrong, and we might as well stop execution:

same InChI key appears in both the training and test set in any CV split
same InChI key appears in more than one row in tabulate_molecules output (or collect_tabulated_molecules, or process_tabulated_molecules)
same InChI key appears in both the training set and has a non-missing frequency in the process_tabulated_molecules column for that CV fold

Or, if not an assert statement, maybe it's worth checking if any of these conditions are met in some other way?

vineetbansal commented 2 months ago

Thanks @skinnider - these asserts are a great idea.

I won't recommend putting them in the actual code because they will take too much time/memory (some of them aren't checkable in a distributed setting anyway, like if a single inchikey appears in more than one fold), but we can put in these individual checks in the relevant unit tests that we have for the relevant steps, and that will provide an early warning for these kinds of errors.

vineetbansal commented 2 months ago

After talking to Anushka today, this one seems like a tricky one to verify, since the intermediate DataFrame (which has frequency across folds), is not returned by the function, but rather the aggregation across columns (folds).

A minor change (introduction of an intermediate function) should allow this - adding this comment here so we remember to do this.

skinniderlab / CLM

compare by inchikey, not canonical smiles #131