Detecting and handling molecule perception differences between toolkits

j-wags commented 5 years ago

The code

toolkit_wrapper = RDKitToolkitWrapper()
rdmol = molecule.to_rdkit()
molecule_copy = Molecule(rdmol)
mol_smi = molecule.to_smiles(toolkit_registry=toolkit_wrapper)
mol_copy_smi = molecule_copy.to_smiles(toolkit_registry=toolkit_wrapper)
assert mol_smi == mol_copy_smi:

passes for all molecules in our test set. However, if toolkit_wrapper=OpenEyeToolkitWrapper(), it fails for 91 / 365 molecules, showing small differences in SMILES.

A few examples are shown below, see test_molecule.create_from_rdkit for details:

DrugBank_2800
[H]c1c(c(c2c(c1[H])N=C(S2)SC([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])[H])[H]
[H]c1c(c(c2c(c1[H])nc(s2)SC([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])[H])[H]
DrugBank_104
[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(/C(=N/[C@]([H])(/C(=N/[C@]([H])([C@]([H])([C@]([H])(C2(C(C2([H])[H])([H])[H])[H])O[H])O[H])C([H])([H])C3(C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])/O[H])C([H])([H])C4=C(N=C(N4[H])[H])[H])/O[H])C([H])([H])S(=O)(=O)C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H]
[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(/C(=N/[C@]([H])(/C(=N/[C@]([H])([C@]([H])([C@]([H])(C2(C(C2([H])[H])([H])[H])[H])O[H])O[H])C([H])([H])C3(C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])/O[H])C([H])([H])c4c(nc(n4[H])[H])[H])/O[H])C([H])([H])S(=O)(=O)C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H]
DrugBank_2982
[H]c1c(c(c(c(c1C([H])([H])[C@@]([H])(C(=O)N([H])S(=O)(=O)OC([H])([H])[C@]2([C@]([C@@]([C@](O2)([H])N3c4c(c(nc(n4)[H])N([H])[H])N=C3[H])([H])O[H])([H])O[H])[H])N([H])[H])[H])[H])O[H])[H]
[H]c1c(c(c(c(c1C([H])([H])[C@@]([H])(C(=O)N([H])S(=O)(=O)OC([H])([H])[C@]2([C@]([C@@]([C@](O2)([H])n3c(nc4c3nc(nc4N([H])[H])[H])[H])([H])O[H])([H])O[H])[H])N([H])[H])[H])[H])O[H])[H]
DrugBank_246
[H]C1=C(N=C(S1)[N+]([H])([H])[H])/C(=N/OC(C(=O)O[H])(C([H])([H])[H])C([H])([H])[H])/C(=O)N([H])[C@@]2(C(=O)N([C@@]2([H])C([H])([H])[H])S(=O)(=O)[O-])[H]
[H]c1c(nc(s1)[N+]([H])([H])[H])/C(=N/OC(C(=O)O[H])(C([H])([H])[H])C([H])([H])[H])/C(=O)N([H])[C@@]2(C(=O)N([C@@]2([H])C([H])([H])[H])S(=O)(=O)[O-])[H]
DrugBank_5555
[H]c1c(c(c(c2c1N(C(=C2[H])C(=O)N([H])[C@@]3([C@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])N([H])C(=O)C4=NC5=C(S4)C([N@@](C(C5([H])[H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])Cl)[H]
[H]c1c(c(c(c2c1n(c(c2[H])C(=O)N([H])[C@@]3([C@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])N([H])C(=O)c4nc5c(s4)C([N@@](C(C5([H])[H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])Cl)[H]
DrugBank_3011
[H]c1nc2c(c(n1)N([H])[H])N=C(N2[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])([C@@]([H])(C([H])([H])[H])O[H])N([H])[H])([H])O[H])([H])O[H])[H])[H]
[H]c1nc2c(c(n1)N([H])[H])nc(n2[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])([C@@]([H])(C([H])([H])[H])O[H])N([H])[H])([H])O[H])([H])O[H])[H])[H]

See test_molecule.test_create_rdkit

jchodera commented 5 years ago

We can probably solve this problem by adding a convenience method to the ToolkitWrapper called normalize_smiles() that takes SMILES, creates an intermediate toolkit molecule, and computes the SMILES again. We can then call this on the SMILES before checking equality.

hjuinj commented 5 years ago

This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol) in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").

davidlmobley commented 5 years ago

@j-wags did you see this? I indeed expect sanitization might be an issue (it had seemed overambitious in the past); @bannanc may have more thoughts.

bannanc commented 5 years ago

Is comparing the SMILES output really the best option? I think we know from @ChayaSt's work that getting a reproducible canonical SMILES from both RDKit and OpenEye might be difficult. Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc? So something like openforcefield_mol.is_equal(test_openforcefield_mol)?

bannanc commented 5 years ago

Although, you're not going to get agreement unless you're using the same aromaticity model and it does look like that is the problem in the examples you show here.

j-wags commented 5 years ago

Is comparing the SMILES output really the best option?

No... It's not, but it's what I put in when time was short. Now I have some time to improve it.

Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc?

Yes -- I think that's the way to go. I implemented a networkx molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)

This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol) in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").

This is a separate and important issue. I'll check out OFFMol --> RDMol --> OFFMol roundtrips too.

bannanc commented 5 years ago

Yes -- I think that's the way to go. I implemented a networkx molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)

As long as you have ways to compare atoms and bonds networkx might have a way to see if the graphs overlay. This is something I've been meaning to look into, but haven't quite gotten a chance.

j-wags commented 5 years ago

https://github.com/openforcefield/openforcefield/blob/topology/openforcefield/topology/topology.py#L1300-L1403

:-)

j-wags commented 5 years ago

Ok, so molecule equality is now evaluated using networkx (https://github.com/openforcefield/openforcefield/pull/86/commits/247fb61a5fe0c3895c22ba2bf522959bea1c592b)

Still have the issues with RDKit roundtrips. I'll pick back up on that tomorrow.

j-wags commented 5 years ago

An update here --

1) My original code to get/set stereochemistry from RDMols was awful, and I've fixed it. Some deeper problems with stereochemistry remain (eg. OpenEye sees stereochemistry around trivalent nitrogens, RDKit doesn't, we'll need a manual fix)

2) I've added a notebook in my most recent PR that loads MiniDrugBank from mol2 using OpenEye, and then from sdf using RDKit, and tests for molecule equality. 293/346 SMILES comparisons succeed, which is pretty good, but could be better.

A big writeup including further toolkit perception comparisons and results can be found in examples/rdkit_openeye_comparison/compare_rdk_oe_mol_loading.ipynb in the topology branch. This would be a good starting point for anyone who wants to improve the ToolkitWrappers.

jchodera commented 5 years ago

@j-wags : This is great! What if we added a "Known issues" section to the Release History entry for each release, and noted things like this there?

j-wags commented 5 years ago

Note: The linked Issue contains a table showing (I think) the InChI-recognized stereogenic groups, which RDKit uses: https://github.com/rdkit/rdkit/issues/2268

ChayaSt commented 4 years ago

@j-wags, see this issue on cmiles https://github.com/openforcefield/cmiles/issues/36 For that molecule, a pyramidal nitrogen flipped from planar to pyramidal during a geometry optimization.

mattwthompson commented 4 years ago

A few questions to catch me up on this

The flakiness of round-tripping SMILES seems to be an unavoidable issue, correct? As in, planning for a future in which all toolkits have identical to_smiles/from_smiles behavior is the wrong approach
Is the problem here that round-tripping a SMILES with a toolkit is unreliable, that Molecule. to_smiles has different behavior, or all of the above (or all of the above and more)?
It was intended for a slightly different purpose, but some of the validation functions built into QCSubmit could at least catch some of these issues, which could enable handling them more gracefully. I don't have an opinion on what should be done after isolating those cases.

Building off of some of the discussion in slack, it may be useful to define what sort of differences we care about (and what we don't). Defining molecular properties (i.e. stereochemistry) in a way that ensures consistent behavior across domains (MD, electronic structure, cheminformatics) and toolkits seem to be hard, if not impossible. Is there a way to frame this generally, or is this inevitably going to be whack-a-mole?

j-wags commented 4 years ago

@jchodera @mattwthompson @davidlmobley @jthorton I have some proposals on how to move forward here:

In the short-term, we could default to not consider nitrogen stereochemistry in Molecule.are_isomorphic by adding a new kwarg ignore_tertiary_nitrogen_stereochemistry=True, and manually stripping the stereochemistry info from those. (could use the SMARTS [NX3:1](-[*])(-[*])(-[*])
I really don't want to touch how to/from_openeye/rdkit sets stereochemistry flags, becuase then we'll start failing all sorts of roundtrips
We experiment with a new Atom/Bond.properties['stereochemistry_is_short_lived'] property that indicates whether we SHOULD consider specific stereochemistry to be significant.
- This flag could be set by the OFF toolkit, and we could base it off a table of SMARTS corresponding to the cases in #146
- We'd treat this property like we currently treat atom maps -- It's not a "spec" attribute of Atoms/Bonds, but we store in in the properties dict as a sort of "trial" attribute.
We could implement a global openforcefield.paranoid_mode flag that can default to False, and slowly roll it out to parts of the toolkit starting with Molecule.are_isomorphic. This could replace the allow_undefined_stereo flags in some places and could allow, for example, generate_conformers to silently expand unknown stereochemistry.
I'm coming to the realization that bond order is NOT relevant for a graph isomorphism. This is because aromatic molecules could have different kekule structures but actually still be the "same" chemical species. So maybe we should make two changes:
- Make Molecule.are_isomorphic ignore bond orders
- Start enforcing aromaticity perception at defined points in the Molecule/Topology lifecycle, so we know when it's safe to compare molecules based on a specific aromaticity model.

j-wags commented 4 years ago

@mattwthompson

The flakiness of round-tripping SMILES seems to be an unavoidable issue, correct? As in, planning for a future in which all toolkits have identical to_smiles/from_smiles behavior is the wrong approach

Correct.

Many distinct "graph molecules" could be represented by the same SMILES, and vice versa.

One SMILES could represent different OFFMols because of:

differences in atom ordering
differences in aromaticity perception

One OFFMol could be represented by different SMILES because of:

differences in whether to include H's explicitly
differences in whether to encode aromaticity using letter cases versus explicit bond orders * (a benzene ring could be written as C1-C=C-C=C-C=1 or c1ccccc1 and both are valid)
differences in "which stereochemistry is actually important"
differences in atom ordering (as in: order of atoms from left to right in the SMILES)

We try to avoid many of these by using "isomeric canonical explicit hydrogen SMILES" and forcing the use of a defined aromaticity model, but this still suffers from arbitrary atom orderings. Each cheminformatics toolkit tries to select one atom ordering to be the "canonical" SMILES for a given molecule, but:

Different cheminformatics toolkits pick different atom orderings from each other
A cheminformatics toolkit's canonicalization scheme can change between versions.

davidlmobley commented 4 years ago

@j-wags I think I agree with most of your proposals, and your general line of thought.

I am less confident about these specific issues:

We could implement a global openforcefield.paranoid_mode flag that can default to False, and slowly roll it out to parts of the toolkit starting with Molecule.are_isomorphic. This could replace the allow_undefined_stereo flags in some places and could allow, for example, generate_conformers to silently expand unknown stereochemistry. I'm coming to the realization that bond order is NOT relevant for a graph isomorphism. This is because aromatic molecules could have different kekule structures but actually still be the "same" chemical species. So maybe we should make two changes:

Make Molecule.are_isomorphic ignore bond orders

Start enforcing aromaticity perception at defined points in the Molecule/Topology lifecycle, so we know when it's safe to compare molecules based on a specific aromaticity model.

I'm not totally clear as to the effects/implications of these. The first sounds the most reasonable; the second I have more questions about because I get nervous about ignoring bond orders -- though in principle I agree with your point that it's not relevant for graph isomorphism, as long as the "total" bond order (OK, I invented that -- but what I mean by it is the total number of electrons present in bonds) remains the same.

Also, we are eventually going to want to move away from aromaticity models (WBO interpolation support is eventually going to make this unnecessary and we'll have far fewer parameters which explicitly involve bond order) -- which both (I think) reinforces your point and undermines it. Particularly, since we'll want to move away from aromaticity models, you won't want to start enforcing aromaticity model perception in the code. But... yes, we shouldn't be relying on kekulization and isomorphism may not need to rely on bond orders.

So... I hope that helps. Happy to discuss directly if that'd be more helpful.

openforcefield / openff-toolkit

Detecting and handling molecule perception differences between toolkits #156