openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

JSON input molecule #10

Closed ChayaSt closed 5 years ago

ChayaSt commented 5 years ago

Add the ability to generate identifiers from QCSchema.

The minimum fields required to generate a molecular graph are different for RDKit and OpenEye. For OpenEye, the minimum is either symbols and connectivity or symbols and geometry because OpenEye can perceive bonds and bond types from distance. However, the units need to be in Angstroms. @dgasmith, are the units in QCSchema always Bohr? Is there a way to ensure that? If geometry exists we want to use it to make sure we get the stereochemistry right.

For RDKit, the connectivity table is needed. RDKit cannot perceive bonds from distance.

jchodera commented 5 years ago

Be sure to loop in @j-wags who is working on our Open Forcefield serialized representations of molecules and topologies!

dgasmith commented 5 years ago

QCSchema is always in Bohr without exception. If it is not in Bohr it is an error in the input.

ChayaSt commented 5 years ago

Update on this:

I was not able to get OpenEye to perceive connectivity and stereochemistry from JSON geometry. It does perceive it if it gets read in from a file but we don't want to do that. RDKit does a find job perceiving stereochemistry from the JSON geometry but it also need the connectivity table to generate a molecular graph. So currently cmiles uses RDKit and the connectivity table to create a molecule.

Also, cmiles assumes the QCSchema so all units are assumed to be in Bohr. @j-wags, once the Open Forcefield serialized representation is complete, we should add the ability to generate molecular id's from that representation. We will need a way to keep track of units then.

j-wags commented 5 years ago

Thanks for the ping, @ChayaSt. Right now, all quantities in the SMIRNOFF Molecule class are unit-wrapped numpy arrays (simtk.unit.Quantity objects). This should make unit conversion straightforward when we get there.

I was not able to get OpenEye to perceive connectivity and stereochemistry from JSON geometry. It does perceive it if it gets read in from a file but we don't want to do that.

I was wrestling with OE and stereochemistry two weeks ago. One helpful function had been oechem.OE3DToInternalStereo(oemol), though I mucked around with a bunch of other functions. If you're up for it, we could do a screenshare today to look into this.

ChayaSt commented 5 years ago

I was wrestling with OE and stereochemistry two weeks ago. One helpful function had been oechem.OE3DToInternalStereo(oemol), though I mucked around with a bunch of other functions. If you're up for it, we could do a screenshare today to look into this.

Thanks! I first wasn't even able to get Openeye to perceive connectivity from coordinates and found the function oemol.SetDimension(3) that then allows oechem.OEDetermineConnectivity(oemol) to work. I then called a bunch of functions on the molecule:

oechem.OEFindRingAtomsAndBonds(molecule)
oechem.OEPerceiveBondOrders(molecule)
oechem.OEAssignImplicitHydrogens(molecule)
oechem.OEAssignFormalCharges(molecule)
oechem.OEAssignAromaticFlags(molecule)

These functions are always called when openeye reads a molecule from an xyz file. Once I did this the molecule has the stereochemistry information and I don't need to perceive it.

j-wags commented 5 years ago

That's great. Thanks for looking into this!

We had talked about finding undefined stereochemistry the other day, and I wanted to document that somewhere: Basically, I couldn't find a built-in OE function to check for undefined stereochemistry. Here's what I'm using currently in SMIRNOFF:

from openeye import oechem
from openforcefield.topology.molecule import Molecule

oechem.OEPerceiveChiral(oemol)

# Check that all stereo is specified
unspec_chiral = False
unspec_db = False
problematic_atoms = list()
problematic_bonds = list()

for oeatom in oemol.GetAtoms():
    if oeatom.IsChiral():
        if not (oeatom.HasStereoSpecified()):
            unspec_chiral = True
            problematic_atoms.append(oeatom)
for oebond in oemol.GetBonds():
    if oebond.IsChiral():
        if not (oebond.HasStereoSpecified()):
            unspec_db = True
            problematic_bonds.append(oebond)
ChayaSt commented 5 years ago

Thank you!

This is really helpful!

ChayaSt commented 5 years ago

@j-wags, turns out the code snippet above also finds spurious missing steroechemistry. Example SMILES: N[C@@H](CCCNC(N)=N)C(O)=O image

It finds this bond to be missing stereochemistry: [(7, 'C', 9, 'N', 2)] (that's C=N) which is a terminal bond so it's not chiral.

Turns out, atom.HasStereoSpecified(oechem.OEAtomStereo_Tetrahedral) and bond.HasStereoSepcified(oechem.OEBondStereo_CisTrans) will be true for chiral bonds and atoms even though the specific chirality is not defined. You have to check what the handedness is and check if that is undefined.

Here is the code I'm using now.

def is_stereochemistry_defined(molecule):
    unspec_chiral = False
    unspec_db = False
    problematic_atoms = list()
    problematic_bonds = list()
    for atom in molecule.GetAtoms():
        if atom.IsChiral() or atom.HasStereoSpecified(oechem.OEAtomStereo_Tetrahedral):
            # Check if handness is specified
            v = []
            for nbr in atom.GetAtoms():
                v.append(nbr)
            stereo = atom.GetStereo(v, oechem.OEAtomStereo_Tetrahedral)
            if stereo == oechem.OEAtomStereo_Undefined:
                unspec_chiral = True
                problematic_atoms.append((atom.GetIdx(), oechem.OEGetAtomicSymbol(atom.GetAtomicNum())))
    for bond in molecule.GetBonds():
        if bond.IsChiral() or bond.HasStereoSpecified(oechem.OEBondStereo_CisTrans):
            v = []
            for neigh in bond.GetBgn().GetAtoms():
                if neigh != bond.GetEnd():
                    v.append(neigh)
                    break
            for neigh in bond.GetEnd().GetAtoms():
                if neigh != bond.GetBgn():
                    v.append(neigh)
                    break
            stereo = bond.GetStereo(v, oechem.OEBondStereo_CisTrans)
            if stereo == oechem.OEBondStereo_Undefined:
                unspec_db = True
                a1 = bond.GetBgn()
                a2 = bond.GetEnd()
                a1_idx = a1.GetIdx()
                a2_idx = a2.GetIdx()
                a1_s = oechem.OEGetAtomicSymbol(a1.GetAtomicNum())
                a2_s = oechem.OEGetAtomicSymbol(a2.GetAtomicNum())
                bond_order = bond.GetOrder()
                problematic_bonds.append((a1_idx, a1_s, a2_idx, a2_s, bond_order))
    if unspec_chiral or unspec_db:
        raise ValueError("Stereochemistry is unspecified. Problematic atoms {}, problematic bonds {}".format(problematic_atoms,
                                                                                                             problematic_bonds))
    else:
        return True

edit: fixed code snippet.

ChayaSt commented 5 years ago

@j-wags, not sure if you are using that snippet here https://github.com/openforcefield/openforcefield/issues/141#issuecomment-442148411 but this might be picking up spurious problems.

j-wags commented 5 years ago

Very interesting! I was using that snippet there. Thanks for making the connection, I'll make sure it gets passed on to the other PR.

Thanks for doing the research above. I've got a few things to finish up with Topology right now, but I'll try out your code soon.

ChayaSt commented 5 years ago

Addressed by #13