openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
313 stars 92 forks source link

Difference in InChIKey values, from SMILES, when comparing to RDKit #1133

Open mjw99 opened 2 years ago

mjw99 commented 2 years ago

If I run the following with RDKit (2021.09.2):

from rdkit import Chem    

m1 = Chem.MolFromSmiles("N1C[C@@]2(CC[C@H](C)CC2)CC1")  
print(Chem.MolToInchiKey(m1))    

m2 = Chem.MolFromSmiles("N1C[C@]2(CC[C@H](C)CC2)CC1")  
print(Chem.MolToInchiKey(m2)) 

I obtain:

SGJOFPWLAFWQEP-AOOOYVTPSA-N
SGJOFPWLAFWQEP-MGCOHNPYSA-N

If I do something similar with OpenFF's to_inchikey() method (openff-toolkit 0.10.1)

from openff.toolkit.topology import Molecule    

m1 = Molecule.from_smiles("N1C[C@@]2(CC[C@H](C)CC2)CC1")  
print(m1.to_inchikey())      

m2 = Molecule.from_smiles("N1C[C@]2(CC[C@H](C)CC2)CC1")   
print(m2.to_inchikey()) 

I obtain:

SGJOFPWLAFWQEP-UHFFFAOYSA-N
SGJOFPWLAFWQEP-UHFFFAOYSA-N

Any ideas why? I am missing something?

Many thanks,

Mark

j-wags commented 2 years ago

Thanks for the report, @mjw99, and sorry for the delay in responding!

Summary

This is a tough one. The OpenFF toolkit shows the same behavior with both RDKit and OpenEye backends. I think the fundamental problem is that 1,4-disubstituted cyclohexane is impossible to represent using CIP stereochemistry definitions (R/S). And right now the OpenFF molecule class exclusively uses CIP stereochemistry (#139). If this is correct, the fix would unfortunately need to be a pretty fundamental change to our object model.

I'm frankly surprised that the failure case for this stereochemistry model is so simple. I always knew that there were tricky issues with stereo, but I'd expected to have to go bigger than 8 heavy atoms to exercise it. Apologies, I can't really offer a solution or workaround to your issue in the short term, other than recommending use of RDKit to generate the inchis from smiles.

Details

Here's a minimum reproducing example of the issue (this rdkit code is run inside Molecule.from_smiles): image

from rdkit import Chem
rdmol = Chem.MolFromSmiles("[C@@H]1(Cl)CC[C@H](F)CC1")  
rdmol = Chem.AddHs(rdmol, addCoords=True)
Chem.SanitizeMol(
    rdmol,
    (
        Chem.SANITIZE_ALL
        ^ Chem.SANITIZE_SETAROMATICITY
        ^ Chem.SANITIZE_ADJUSTHS
        ^ Chem.SANITIZE_CLEANUPCHIRALITY
        ^ Chem.SANITIZE_KEKULIZE
    ),
)
Chem.Kekulize(rdmol)
Chem.AssignStereochemistry(rdmol, cleanIt=False)
for rda in rdmol.GetAtoms(): 
    #print(rda.GetPropsAsDict()) # Uncomment to get more info
    print(rda.GetChiralTag())
    if rda.HasProp("_CIPCode"):
        print(rda.GetProp("_CIPCode"))

print(Chem.MolToInchiKey(rdmol))

CHI_TETRAHEDRAL_CCW CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_TETRAHEDRAL_CCW CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED ONCMGYJUZHIUAH-OLQVQODUSA-N

The issue is that we rely on "actually important"[1] CHI_TETRAHEDRAL atoms to get populated with CIP codes in the code above. But 1,4-disubstituted cyclohexane doesn't get CIP codes assigned (because I think the CIP model simply doesn't handle it). This would be handled gracefully if the stereochemistry in our object model were defined as CW/CCW at the stereo atoms (directly analagous to @ signs in SMILES), or cis/trans for the relative configurations of its stereocenters.

I'm pretty sure the CIP stereo conversion is the root cause here. If I just shift a substituent so it becomes 1,3-disubsituted cyclohexane, then things are fine again:

image

from rdkit import Chem
rdmol = Chem.MolFromSmiles("[C@@H]1(Cl)C[C@H](F)CCC1")  
rdmol = Chem.AddHs(rdmol, addCoords=True)
Chem.SanitizeMol(
    rdmol,
    (
        Chem.SANITIZE_ALL
        ^ Chem.SANITIZE_SETAROMATICITY
        ^ Chem.SANITIZE_ADJUSTHS
        ^ Chem.SANITIZE_CLEANUPCHIRALITY
        ^ Chem.SANITIZE_KEKULIZE
    ),
)
Chem.Kekulize(rdmol)
Chem.AssignStereochemistry(rdmol, cleanIt=False)
for rda in rdmol.GetAtoms(): 
    #print(rda.GetPropsAsDict()) # Uncomment to get more info
    print(rda.GetChiralTag())
    if rda.HasProp("_CIPCode"):
        print(rda.GetProp("_CIPCode"))

print(Chem.MolToInchiKey(rdmol))

CHI_TETRAHEDRAL_CCW S CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_TETRAHEDRAL_CCW R CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED CHI_UNSPECIFIED GANUWPARZHDFHA-NTSWFWBYSA-N

We see that for each chiral center, a CIP code is also printed.

So unfortunately, I think the real problem here is that we made a bad choice with our object model. This is a real bug, but I can't think of a way to fix it short of a fairly extensive refactor.

[1] You might say "well, why not just require all chiral atoms to be S or R?", but "actually important stereochemistry" is a key term here. We've encountered cases where inputs have overdefined stereo, where nonchiral atoms have stereo assigned for some reason (#146), and misclassifying those can also break downstream machinery. So we let RDKit "take the wheel" to some extent and discard stereo that it doesn't like.

mjw99 commented 2 years ago

Dear Jeff, I thank you for your extensive analysis here and looking into this; I really appreciate the details.

I will attempt the RDKit workaround for the moment; initially I had been using the InChIKey as a unique director name identifier in a workflow.

j-wags commented 2 years ago

So, actually I'm wrong. Some searching turned up that the CIP rules can handle this using a concept called "pseudoasymmetry", it's just that the underlying logic is manifestation of my nightmares. But long story short, CIP stereochem may be able to handle this molecule, it's just that the chiral centers should get lowercase letters instead of upper case (r/s instead of R/S), and maybe not all of them need to be defined. I'll look into whether RDKit/OpenEye can provide these.

j-wags commented 2 years ago

I've started a discussion on the RDKit board here: https://github.com/rdkit/rdkit/discussions/4731

j-wags commented 2 years ago

Per the discussion above, it looks like a new feature in RDKit CAN handle this.

I'll start poking around in OpenEye as well. One hint is the yellow warning in the Omega docs that ring stereo may not be respected.