openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 91 forks source link

Detecting and handling molecule perception differences between toolkits #156

Open j-wags opened 5 years ago

j-wags commented 5 years ago

The code

toolkit_wrapper = RDKitToolkitWrapper()
rdmol = molecule.to_rdkit()
molecule_copy = Molecule(rdmol)
mol_smi = molecule.to_smiles(toolkit_registry=toolkit_wrapper)
mol_copy_smi = molecule_copy.to_smiles(toolkit_registry=toolkit_wrapper)
assert mol_smi == mol_copy_smi:

passes for all molecules in our test set. However, if toolkit_wrapper=OpenEyeToolkitWrapper(), it fails for 91 / 365 molecules, showing small differences in SMILES.

A few examples are shown below, see test_molecule.create_from_rdkit for details:

DrugBank_2800
[H]c1c(c(c2c(c1[H])N=C(S2)SC([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])[H])[H]
[H]c1c(c(c2c(c1[H])nc(s2)SC([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])[H])[H]
DrugBank_104
[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(/C(=N/[C@]([H])(/C(=N/[C@]([H])([C@]([H])([C@]([H])(C2(C(C2([H])[H])([H])[H])[H])O[H])O[H])C([H])([H])C3(C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])/O[H])C([H])([H])C4=C(N=C(N4[H])[H])[H])/O[H])C([H])([H])S(=O)(=O)C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H]
[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(/C(=N/[C@]([H])(/C(=N/[C@]([H])([C@]([H])([C@]([H])(C2(C(C2([H])[H])([H])[H])[H])O[H])O[H])C([H])([H])C3(C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])/O[H])C([H])([H])c4c(nc(n4[H])[H])[H])/O[H])C([H])([H])S(=O)(=O)C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H]
DrugBank_2982
[H]c1c(c(c(c(c1C([H])([H])[C@@]([H])(C(=O)N([H])S(=O)(=O)OC([H])([H])[C@]2([C@]([C@@]([C@](O2)([H])N3c4c(c(nc(n4)[H])N([H])[H])N=C3[H])([H])O[H])([H])O[H])[H])N([H])[H])[H])[H])O[H])[H]
[H]c1c(c(c(c(c1C([H])([H])[C@@]([H])(C(=O)N([H])S(=O)(=O)OC([H])([H])[C@]2([C@]([C@@]([C@](O2)([H])n3c(nc4c3nc(nc4N([H])[H])[H])[H])([H])O[H])([H])O[H])[H])N([H])[H])[H])[H])O[H])[H]
DrugBank_246
[H]C1=C(N=C(S1)[N+]([H])([H])[H])/C(=N/OC(C(=O)O[H])(C([H])([H])[H])C([H])([H])[H])/C(=O)N([H])[C@@]2(C(=O)N([C@@]2([H])C([H])([H])[H])S(=O)(=O)[O-])[H]
[H]c1c(nc(s1)[N+]([H])([H])[H])/C(=N/OC(C(=O)O[H])(C([H])([H])[H])C([H])([H])[H])/C(=O)N([H])[C@@]2(C(=O)N([C@@]2([H])C([H])([H])[H])S(=O)(=O)[O-])[H]
DrugBank_5555
[H]c1c(c(c(c2c1N(C(=C2[H])C(=O)N([H])[C@@]3([C@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])N([H])C(=O)C4=NC5=C(S4)C([N@@](C(C5([H])[H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])Cl)[H]
[H]c1c(c(c(c2c1n(c(c2[H])C(=O)N([H])[C@@]3([C@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])N([H])C(=O)c4nc5c(s4)C([N@@](C(C5([H])[H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])Cl)[H]
DrugBank_3011
[H]c1nc2c(c(n1)N([H])[H])N=C(N2[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])([C@@]([H])(C([H])([H])[H])O[H])N([H])[H])([H])O[H])([H])O[H])[H])[H]
[H]c1nc2c(c(n1)N([H])[H])nc(n2[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])([C@@]([H])(C([H])([H])[H])O[H])N([H])[H])([H])O[H])([H])O[H])[H])[H]

See test_molecule.test_create_rdkit

jchodera commented 5 years ago

We can probably solve this problem by adding a convenience method to the ToolkitWrapper called normalize_smiles() that takes SMILES, creates an intermediate toolkit molecule, and computes the SMILES again. We can then call this on the SMILES before checking equality.

hjuinj commented 5 years ago

This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol) in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").

davidlmobley commented 5 years ago

@j-wags did you see this? I indeed expect sanitization might be an issue (it had seemed overambitious in the past); @bannanc may have more thoughts.

bannanc commented 5 years ago

Is comparing the SMILES output really the best option? I think we know from @ChayaSt's work that getting a reproducible canonical SMILES from both RDKit and OpenEye might be difficult. Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc? So something like openforcefield_mol.is_equal(test_openforcefield_mol)?

bannanc commented 5 years ago

Although, you're not going to get agreement unless you're using the same aromaticity model and it does look like that is the problem in the examples you show here.

j-wags commented 5 years ago

Is comparing the SMILES output really the best option?

No... It's not, but it's what I put in when time was short. Now I have some time to improve it.

Could we look into making a function that checks for the same molecule based on connectivity/aromaticity/etc?

Yes -- I think that's the way to go. I implemented a networkx molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)

This looks like an aromaticity perception difference for heteroaromatics. Could this be due to Chem.SanitizeMol(rdmol) in the to_rdkit function? The default sanitization tinkers with aromaticity (doc string: "Kekulize, check valencies, set aromaticity, conjugation and hybridization").

This is a separate and important issue. I'll check out OFFMol --> RDMol --> OFFMol roundtrips too.

bannanc commented 5 years ago

Yes -- I think that's the way to go. I implemented a networkx molecule matching function for making openMM systems from PDB. I'll go ahead and implement the same thing for molecule equality (atoms match with element+aromaticity, bonds match with order+aromaticity)

As long as you have ways to compare atoms and bonds networkx might have a way to see if the graphs overlay. This is something I've been meaning to look into, but haven't quite gotten a chance.

j-wags commented 5 years ago

https://github.com/openforcefield/openforcefield/blob/topology/openforcefield/topology/topology.py#L1300-L1403

:-)

j-wags commented 5 years ago

Ok, so molecule equality is now evaluated using networkx (https://github.com/openforcefield/openforcefield/pull/86/commits/247fb61a5fe0c3895c22ba2bf522959bea1c592b)

Still have the issues with RDKit roundtrips. I'll pick back up on that tomorrow.

j-wags commented 5 years ago

An update here --

1) My original code to get/set stereochemistry from RDMols was awful, and I've fixed it. Some deeper problems with stereochemistry remain (eg. OpenEye sees stereochemistry around trivalent nitrogens, RDKit doesn't, we'll need a manual fix)

2) I've added a notebook in my most recent PR that loads MiniDrugBank from mol2 using OpenEye, and then from sdf using RDKit, and tests for molecule equality. 293/346 SMILES comparisons succeed, which is pretty good, but could be better.

A big writeup including further toolkit perception comparisons and results can be found in examples/rdkit_openeye_comparison/compare_rdk_oe_mol_loading.ipynb in the topology branch. This would be a good starting point for anyone who wants to improve the ToolkitWrappers.

jchodera commented 5 years ago

@j-wags : This is great! What if we added a "Known issues" section to the Release History entry for each release, and noted things like this there?

j-wags commented 5 years ago

Note: The linked Issue contains a table showing (I think) the InChI-recognized stereogenic groups, which RDKit uses: https://github.com/rdkit/rdkit/issues/2268

ChayaSt commented 4 years ago

@j-wags, see this issue on cmiles https://github.com/openforcefield/cmiles/issues/36 For that molecule, a pyramidal nitrogen flipped from planar to pyramidal during a geometry optimization.

mattwthompson commented 4 years ago

A few questions to catch me up on this

Building off of some of the discussion in slack, it may be useful to define what sort of differences we care about (and what we don't). Defining molecular properties (i.e. stereochemistry) in a way that ensures consistent behavior across domains (MD, electronic structure, cheminformatics) and toolkits seem to be hard, if not impossible. Is there a way to frame this generally, or is this inevitably going to be whack-a-mole?

j-wags commented 4 years ago

@jchodera @mattwthompson @davidlmobley @jthorton I have some proposals on how to move forward here:

j-wags commented 4 years ago

@mattwthompson

The flakiness of round-tripping SMILES seems to be an unavoidable issue, correct? As in, planning for a future in which all toolkits have identical to_smiles/from_smiles behavior is the wrong approach

Correct.

Many distinct "graph molecules" could be represented by the same SMILES, and vice versa.

One SMILES could represent different OFFMols because of:

One OFFMol could be represented by different SMILES because of:

We try to avoid many of these by using "isomeric canonical explicit hydrogen SMILES" and forcing the use of a defined aromaticity model, but this still suffers from arbitrary atom orderings. Each cheminformatics toolkit tries to select one atom ordering to be the "canonical" SMILES for a given molecule, but:

davidlmobley commented 4 years ago

@j-wags I think I agree with most of your proposals, and your general line of thought.

I am less confident about these specific issues:

We could implement a global openforcefield.paranoid_mode flag that can default to False, and slowly roll it out to parts of the toolkit starting with Molecule.are_isomorphic. This could replace the allow_undefined_stereo flags in some places and could allow, for example, generate_conformers to silently expand unknown stereochemistry. I'm coming to the realization that bond order is NOT relevant for a graph isomorphism. This is because aromatic molecules could have different kekule structures but actually still be the "same" chemical species. So maybe we should make two changes:

  • Make Molecule.are_isomorphic ignore bond orders
  • Start enforcing aromaticity perception at defined points in the Molecule/Topology lifecycle, so we know when it's safe to compare molecules based on a specific aromaticity model.

I'm not totally clear as to the effects/implications of these. The first sounds the most reasonable; the second I have more questions about because I get nervous about ignoring bond orders -- though in principle I agree with your point that it's not relevant for graph isomorphism, as long as the "total" bond order (OK, I invented that -- but what I mean by it is the total number of electrons present in bonds) remains the same.

Also, we are eventually going to want to move away from aromaticity models (WBO interpolation support is eventually going to make this unnecessary and we'll have far fewer parameters which explicitly involve bond order) -- which both (I think) reinforces your point and undermines it. Particularly, since we'll want to move away from aromaticity models, you won't want to start enforcing aromaticity model perception in the code. But... yes, we shouldn't be relying on kekulization and isomorphism may not need to rely on bond orders.

So... I hope that helps. Happy to discuss directly if that'd be more helpful.