Open skinnider opened 2 months ago
An example of what this looks like in inner-canonicalize-molecules.py
(sent via slack, analogous to inner_tabulate_molecules.py):
# roundtrip to get canonical smiles
canonical_smiles = Chem.MolToSmiles(mol, isomericSmiles=False)
# also canonical inchikey
canonical_inchikey = Chem.inchi.MolToInchiKey(mol)
# last: number of tokens
tokens = len(vocab.tokenize(smiles))
# optionally, skip
if not canonical_inchikey in train_inchikey:
# append to file
row = "\t".join([loss, smiles, str(tokens),
canonical_smiles, canonical_inchikey,
str(mass), formula])
_ = f2.write(row + '\n')
f2.flush()
Thanks @skinnider - the details on the changes made it a lot easier. See PR #133 that addresses all these issues.
I've implemented these changes after stepping through the individual steps of the workflow in a debugger. Once I/Anushka modify the test_data
so that the tests pass again, we should have more confidence in these changes.
@anushka255 - one way to proceed with the tests here would be to have the code use our clm.utils.inchikey(mol)
function instead of rdkit
directly. The initial implementation can just return canonical smiles, which should give you identical results as before (with additional columns at every step). A subsequent commit can then call the correct function and we take the new files as the baseline.
Just a thought.
Not sure if this is best practice from a software perspective, but would some assert
statements be useful? e.g., if any of the following conditions are met, something's gone really wrong, and we might as well stop execution:
Or, if not an assert
statement, maybe it's worth checking if any of these conditions are met in some other way?
Thanks @skinnider - these assert
s are a great idea.
I won't recommend putting them in the actual code because they will take too much time/memory (some of them aren't checkable in a distributed setting anyway, like if a single inchikey appears in more than one fold), but we can put in these individual checks in the relevant unit tests that we have for the relevant steps, and that will provide an early warning for these kinds of errors.
After talking to Anushka today, this one seems like a tricky one to verify, since the intermediate DataFrame (which has frequency across folds), is not returned by the function, but rather the aggregation across columns (folds).
A minor change (introduction of an intermediate function) should allow this - adding this comment here so we remember to do this.
Currently, canonical SMILES are being used to ...
(Did I miss any places?)
Occasionally, two or more canonical SMILES can represent the same compound, due to a phenomenon called tautomerism. This can happen either in the training set (i.e. we might have two different canonical SMILES representing the same InChIkey in the training set) or in the model output (i.e. the model might generate multiple different SMILES that all represent the same InChIkey). Therefore, the inchikey is a more robust way to perform all of the above operations.
On the other hand, the inchikey is not a unique representation of a molecule (i.e. we can't parse an inchikey into a molecule in rdkit in the same way that we can with SMILES), so we do need to keep track of one (representative) canonical SMILES per inchikey in all of the above steps.
@vineetbansal let me know if anything here is unclear.