Open j-wags opened 5 years ago
We can probably solve this problem by adding a convenience method to the ToolkitWrapper
called normalize_smiles()
that takes SMILES, creates an intermediate toolkit molecule, and computes the SMILES again. We can then call this on the SMILES before checking equality.
This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol)
in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").
@j-wags did you see this? I indeed expect sanitization might be an issue (it had seemed overambitious in the past); @bannanc may have more thoughts.
Is comparing the SMILES output really the best option? I think we know from @ChayaSt's work that getting a reproducible canonical SMILES from both RDKit and OpenEye might be difficult. Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc? So something like openforcefield_mol.is_equal(test_openforcefield_mol)
?
Although, you're not going to get agreement unless you're using the same aromaticity model and it does look like that is the problem in the examples you show here.
Is comparing the SMILES output really the best option?
No... It's not, but it's what I put in when time was short. Now I have some time to improve it.
Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc?
Yes -- I think that's the way to go. I implemented a networkx
molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)
This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol) in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").
This is a separate and important issue. I'll check out OFFMol --> RDMol --> OFFMol roundtrips too.
Yes -- I think that's the way to go. I implemented a networkx molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)
As long as you have ways to compare atoms and bonds networkx might have a way to see if the graphs overlay. This is something I've been meaning to look into, but haven't quite gotten a chance.
Ok, so molecule equality is now evaluated using networkx (https://github.com/openforcefield/openforcefield/pull/86/commits/247fb61a5fe0c3895c22ba2bf522959bea1c592b)
Still have the issues with RDKit roundtrips. I'll pick back up on that tomorrow.
An update here --
1) My original code to get/set stereochemistry from RDMols was awful, and I've fixed it. Some deeper problems with stereochemistry remain (eg. OpenEye sees stereochemistry around trivalent nitrogens, RDKit doesn't, we'll need a manual fix)
2) I've added a notebook in my most recent PR that loads MiniDrugBank
from mol2 using OpenEye, and then from sdf using RDKit, and tests for molecule equality. 293/346 SMILES comparisons succeed, which is pretty good, but could be better.
A big writeup including further toolkit perception comparisons and results can be found in examples/rdkit_openeye_comparison/compare_rdk_oe_mol_loading.ipynb
in the topology
branch. This would be a good starting point for anyone who wants to improve the ToolkitWrappers.
@j-wags : This is great! What if we added a "Known issues" section to the Release History entry for each release, and noted things like this there?
Note: The linked Issue contains a table showing (I think) the InChI-recognized stereogenic groups, which RDKit uses: https://github.com/rdkit/rdkit/issues/2268
@j-wags, see this issue on cmiles
https://github.com/openforcefield/cmiles/issues/36
For that molecule, a pyramidal nitrogen flipped from planar to pyramidal during a geometry optimization.
A few questions to catch me up on this
to_smiles
/from_smiles
behavior is the wrong approachMolecule. to_smiles
has different behavior, or all of the above (or all of the above and more)?Building off of some of the discussion in slack, it may be useful to define what sort of differences we care about (and what we don't). Defining molecular properties (i.e. stereochemistry) in a way that ensures consistent behavior across domains (MD, electronic structure, cheminformatics) and toolkits seem to be hard, if not impossible. Is there a way to frame this generally, or is this inevitably going to be whack-a-mole?
@jchodera @mattwthompson @davidlmobley @jthorton I have some proposals on how to move forward here:
Molecule.are_isomorphic
by adding a new kwarg ignore_tertiary_nitrogen_stereochemistry=True
, and manually stripping the stereochemistry info from those. (could use the SMARTS [NX3:1](-[*])(-[*])(-[*])
to/from_openeye/rdkit
sets stereochemistry flags, becuase then we'll start failing all sorts of roundtripsAtom
/Bond.properties['stereochemistry_is_short_lived']
property that indicates whether we SHOULD consider specific stereochemistry to be significant.
properties
dict as a sort of "trial" attribute.openforcefield.paranoid_mode
flag that can default to False, and slowly roll it out to parts of the toolkit starting with Molecule.are_isomorphic
. This could replace the allow_undefined_stereo
flags in some places and could allow, for example, generate_conformers
to silently expand unknown stereochemistry.Molecule.are_isomorphic
ignore bond orders@mattwthompson
The flakiness of round-tripping SMILES seems to be an unavoidable issue, correct? As in, planning for a future in which all toolkits have identical to_smiles/from_smiles behavior is the wrong approach
Correct.
Many distinct "graph molecules" could be represented by the same SMILES, and vice versa.
One SMILES could represent different OFFMols because of:
One OFFMol could be represented by different SMILES because of:
C1-C=C-C=C-C=1
or c1ccccc1
and both are valid)We try to avoid many of these by using "isomeric canonical explicit hydrogen SMILES" and forcing the use of a defined aromaticity model, but this still suffers from arbitrary atom orderings. Each cheminformatics toolkit tries to select one atom ordering to be the "canonical" SMILES for a given molecule, but:
@j-wags I think I agree with most of your proposals, and your general line of thought.
I am less confident about these specific issues:
We could implement a global openforcefield.paranoid_mode flag that can default to False, and slowly roll it out to parts of the toolkit starting with Molecule.are_isomorphic. This could replace the allow_undefined_stereo flags in some places and could allow, for example, generate_conformers to silently expand unknown stereochemistry. I'm coming to the realization that bond order is NOT relevant for a graph isomorphism. This is because aromatic molecules could have different kekule structures but actually still be the "same" chemical species. So maybe we should make two changes:
- Make Molecule.are_isomorphic ignore bond orders
- Start enforcing aromaticity perception at defined points in the Molecule/Topology lifecycle, so we know when it's safe to compare molecules based on a specific aromaticity model.
I'm not totally clear as to the effects/implications of these. The first sounds the most reasonable; the second I have more questions about because I get nervous about ignoring bond orders -- though in principle I agree with your point that it's not relevant for graph isomorphism, as long as the "total" bond order (OK, I invented that -- but what I mean by it is the total number of electrons present in bonds) remains the same.
Also, we are eventually going to want to move away from aromaticity models (WBO interpolation support is eventually going to make this unnecessary and we'll have far fewer parameters which explicitly involve bond order) -- which both (I think) reinforces your point and undermines it. Particularly, since we'll want to move away from aromaticity models, you won't want to start enforcing aromaticity model perception in the code. But... yes, we shouldn't be relying on kekulization and isomorphism may not need to rely on bond orders.
So... I hope that helps. Happy to discuss directly if that'd be more helpful.
The code
passes for all molecules in our test set. However, if
toolkit_wrapper=OpenEyeToolkitWrapper()
, it fails for 91 / 365 molecules, showing small differences in SMILES.A few examples are shown below, see
test_molecule.create_from_rdkit
for details:See test_molecule.test_create_rdkit