openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
313 stars 92 forks source link

Toolkit-dependent behavior of Molecule.from_qcschema for two QCA records #749

Open maxentile opened 3 years ago

maxentile commented 3 years ago

Describe the bug Molecule.from_qcschema(record, toolkit_registry=tk) returns different molecules depending on tk for two records in QCArchive "OpenFF Gen 2 Opt Set 2 Coverage" dataset.

(The two records reported here appear to be the only ones in the OpenFF 1.2 training set with this behavior.)

To Reproduce

import qcportal
from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper

oe = OpenEyeToolkitWrapper()
rd = RDKitToolkitWrapper()

dataset_type = "OptimizationDataset"
dataset_name = "OpenFF Gen 2 Opt Set 2 Coverage"
problem_records = ["c1c[nh+]c(cc1cl)[s@@](=o)[o-]-0", "c1[c@@h]2[c@@h]([c@h]2o)c[nh2+]1-0"]

client = qcportal.FractalClient()
ds = client.get_collection(dataset_type, dataset_name)

for record in problem_records:
    print(f"Querying QCA entry:\n\t{dataset_type}\n\t{dataset_name}\n\t{record}")

    entry = ds.get_entry(record)

    # load using one toolkit backend or the other
    mol_rd = Molecule.from_qcschema(entry, toolkit_registry=rd)
    mol_oe = Molecule.from_qcschema(entry, toolkit_registry=oe)

    # with RDKit backend, reported not isomorphic
    rd_isomorphic = mol_oe.is_isomorphic_with(mol_rd, toolkit_registry=rd)
    print(f'RDKit is_isomorphic: {rd_isomorphic}')

    # with OpenEye backend, error encountered when converting `mol_rd` to OpenEye
    try:
        mol_oe.is_isomorphic_with(mol_rd, toolkit_registry=oe)
    except Exception as e:
        print(f'OpenEye is_isomorphic: {e}')
    print()

Output

Querying QCA entry:
    OptimizationDataset
    OpenFF Gen 2 Opt Set 2 Coverage
    c1c[nh+]c(cc1cl)[s@@](=o)[o-]-0
RDKit is_isomorphic: False
OpenEye is_isomorphic: Programming error: OpenEye atom stereochemistry assumptions failed.

Querying QCA entry:
    OptimizationDataset
    OpenFF Gen 2 Opt Set 2 Coverage
    c1[c@@h]2[c@@h]([c@h]2o)c[nh2+]1-0
RDKit is_isomorphic: False
OpenEye is_isomorphic: Programming error: OpenEye atom stereochemistry assumptions failed.

A bit more detail in this notebook: https://gist.github.com/maxentile/ae2a19943181236fff471855977e963e

image

Computing environment:

j-wags commented 3 years ago

Wow. These are really interesting. Thanks for the report. @trevorgokey and I are checking this out.

c1c[nh+]c(cc1cl)[s@@](=o)[o-]-0

I don't know what should happen here. Neither cheminformatics toolkit sees the S or its bonds as aromatic. Basically we think the S is pyramidal. But without any other stereocenters on the molecule, it's not important (for charge calculation) stereochemistry (since both stereoisomers would have identical electronic structures).

So, I think the RDKit interpretation gets this right (the S DOES have stereochemistry). If I can reproduce this using only OpenEye code, I'll bring it up on the issue tracker there.

Oddly, since OpenEye MADE the canonical SMILES (which DOES have stereochemistry) I'm not sure how it lost track of the stereochemistry here.

"c1[c@@h]2[c@@h]([c@h]2o)c[nh2+]1-0"

Screen Shot 2020-10-23 at 4 04 19 PM

This is crazy interesting because the existence of the third stereocenter (the C-OH one) depends on the stereochemistry of the first two. That is, the rightmost carbon in the 3-membered ring is a stereocenter if it has 4 unique substituents. The two "arms" are certainly unique, but the two other ring carbons may or may not be unique from each other, depending on their stereochemistry. I... really don't know what the "correct" underlying toolkit behavior should be in this case. This makes my head hurt.

trevorgokey commented 3 years ago

Want to point out some patterns here, which may be good to collect as we work through these as we try to generalize our findings. A general pattern is that RD will print wedge bonds if you put them in, no matter what. OE will not show wedges if it detects that the atom is not chiral, even if the SMILES specifies that it is chiral. Therefore RD assumes the input SMILES is correct, wheras OE does not assume it is correct. I think this is the biggest pattern that I am noticing across the board.

I tried to build these manually, using from_smiles and enumerating all possible stereocenters. I see that,

For the first molecule: Removing all chiral tags, both OE and RD report that S is chiral, and both fail to build if allow_undefined_stereo=False.

For the second: RD only complains about the first two centers; OE agrees. This means RD is showing it as a wedge just because the SMILES says so.

The confusing part to me here is that, given that OE doesn't see the third center as chiral, how it even showed up in the SMILES to begin with. These SMILES in the entries were generated using CMILES with OE 2019.4.2, which is what I used here. So I am not understanding how those chiral tags are getting in there if OE is not detecting them as chiral.

To be more explicit, we see that

mol = Molecule.from_smiles('C1[C@H]2[C@H]([C@H]2O)C[NH2+]1', toolkit_registry=rd)
mol2 = Molecule.from_smiles('C1[C@H]2[C@H]([C@H]2O)C[NH2+]1', toolkit_registry=rd)
mol.is_isomorphic_with(mol2, toolkit_registry=rd)
True

# flipping the center of one of them, even though it doesn't perceive it as chiral at all
mol = Molecule.from_smiles('C1[C@H]2[C@H]([C@@H]2O)C[NH2+]1', toolkit_registry=rd)
mol2 = Molecule.from_smiles('C1[C@H]2[C@H]([C@H]2O)C[NH2+]1', toolkit_registry=rd)
mol.is_isomorphic_with(mol2, toolkit_registry=rd)
False

even though Molecule.from_smiles('C1[C@H]2[C@H]([CH]2O)C[NH2+]1', toolkit_registry=rd, allow_undefined_stereo=False) works with no issue (note the third center here has no chirality defined). So, RDkit assumes we know what we are talking about :) In OE, both return True, since it implicitly drops the @ tag. This means you can get things like

mol = Molecule.from_smiles('c1c[nH+]c(cc1Cl)[S@@](=O)[O-]', toolkit_registry=oe)
mol2 = Molecule.from_smiles('c1c[nH+]c(cc1Cl)[S@](=O)[O-]', toolkit_registry=oe)
mol.is_isomorphic_with(mol2, toolkit_registry=oe)
True
mol.is_isomorphic_with(mol2, toolkit_registry=rd)
True
mol = Molecule.from_smiles('c1c[nH+]c(cc1Cl)[S@@](=O)[O-]', toolkit_registry=rd)
mol2 = Molecule.from_smiles('c1c[nH+]c(cc1Cl)[S@](=O)[O-]', toolkit_registry=rd)
mol.is_isomorphic_with(mol2, toolkit_registry=oe)
Exception: Programming error: OpenEye atom stereochemistry assumptions failed.
mol.is_isomorphic_with(mol2, toolkit_registry=rd)
False

I am not sure what to take from this. On one hand, I see RDkits side of saying the input is 100% authoritative and blindly follows the data. On the other hand, I see how one would take OE's approach and "do the right thing," but I can't understand why OE produced the SMILES it did, since it seems to be arguing with itself here.

I think there are two problems:

  1. OE detects the sulfur as chiral, but it a. doesn't show it as chiral in the visualization, and b. says it is not isomorphic with the RD form. OE appears to be taking resonance into account and arriving to the correct result, but only after building the molecule, because it does detect it as chiral at first. The final result should be that the sulfur is not chiral, but it is really hard to judge that because the input SMILES explicitly states it as being chiral.

  2. The third center (in the second molecule) is actually chiral, and OE's sanitation is dropping it from the input SMILES. Both miss that it is not chiral (since they both work with allow_undefined_stereo=False), but since RD respects the input SMILES at all times, it preserves the info.

Both of these problems seem to stem from the fact that the input SMILES is technically incorrect, and we are witnessing the difference in how the two TKs handle problems.

maxentile commented 3 years ago

@j-wags, @trevorgokey , I don't have much to add to the observations you've described, but I want to say thank you x 100 for looking into these, and for such clear and perceptive write-ups about the behavior and possible underlying patterns. When I encountered these I had no idea where to start looking, and I'll keep in mind the patterns you identified in case we encounter similar issues in the future.

sidneypaulymer commented 3 years ago

My two cents: I think OE is correct here in both cases:

  1. The sulfur is chiral, but it will rapidly interconvert between stereo-isomers; there is no way (well, no easy way) to isolate an optically pure sample, so it is better to represent it without wedges (or, perhaps better would be to represent it as a racemate)
  2. The far-right carbon is not chiral; there is a plane of symmetry in the xz-plane relative to the orientation of the molecule as shown in the images above.
j-wags commented 3 years ago

Thanks for chiming in, @sidneypaulymer.

The sulfur is chiral, but it will rapidly interconvert between stereo-isomers

This is helpful to know. Generally, we care about stereochemistry when it is long-lived enough to be a "stable" feature of a molecule, versus "likely to change over the course of an MD simulation". This determination helps us know whether to consider stereochemistry when we're doing charge assignment, since charge methods like AM1-BCC are geometry-dependent.

It seems like all pyramidal N stereochemistry (even tertiary amines) interconverts rapidly (the slowest seem to be around 1/millisecond), so they're not "defining" stereochemistry and we'll have the OpenFF toolkit start ignoring them soon.

Pyrimidal S is more complex, and we're looking for primary literature on the inversion rates so we can know which rules to enforce. We now know that sulfurs like this one shouldn't be considered chiral, but sulfurs like that in nexium should be considered chiral.

Do you know of any good sources for rules on which pyrimidal S-containing substructures have stable stereochemistry at room temperature (say, stable for > 1 second) versus not?

cc #467 #146

The far-right carbon is not chiral

I'm not so sure about this. My current thinking is that the molecule really has 3 meaningful stereoisomers, which are hard to describe using R/S notation.

Screen Shot 2020-11-02 at 4 18 59 PM

The unique stereoisomers would be 1) H(A) is coming out of the screen, H(B) is going into the screen, and it doesn't matter where O(C) is 2) H(A) and H(B) are coming out of the screen, O(C) is also coming out of the screen 3) H(A) and H(B) are coming out of the screen, O(C) is going into the screen

Does this make sense? I may be misinterpreting something fundamental here.