openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

Ambiguous hydrogen counts #12

Open ChayaSt opened 5 years ago

ChayaSt commented 5 years ago

We want to ensure that incoming SMILES have unambiguous protonation / tautomer states. There are several issues to look out for:

  1. SMILES reader (Openeye or RDKit) preserve the protonation state of input molecule This might be an issue for inputs that include geometry and can be resolved with sticking to explicit hydrogen SMILES. https://github.com/choderalab/perses/issues/375
  2. Each SMILES reader is internally consistent with how it assigns hydrogen count for implicit hydrogen SMILES. @wiederm has seen cases where this can happen. Explicit hydrogens should also resolve this.
  3. Both SMILES reader in CMILES treat protonation state of the same SMILES consistently. This is not always the case. The corner cases I found:

Corner case 1 c-1-c-c-c-c-c-1'

image

Using explicit hydrogen SMILES helps here. When reading this SMILES, OE adds one H per carbon, RDKit adds 2 H per carbon.

mol = oechem.OEMol()
oechem.OESmilesToMol(mol, 'c-1-c-c-c-c-c-1')

# count implicit hydrogens
imp_h_oe = [atom.GetImplicitHCount() for atom in mol.GetAtoms()]

# RDKit
mol_rd = Chem.MolFromSmiles('c-1-c-c-c-c-c-1')
imp_h_rd = [atom.GetTotalNumHs(False) for atom in mol_rd.GetAtoms()]

print('From implicit H SMILES')
print('oe implicit H: {}'.format(imp_h_oe))
print('rd implicit H: {}'.format(imp_h_rd)) 

# Add explicit H
oechem.OEAddExplicitHydrogens(mol)
new_sm = oechem.OECreateSmiString(mol, oechem.OESMILESFlag_Hydrogens | oechem.OESMILESFlag_ISOMERIC)

mol_oe = oechem.OEMol()
oechem.OESmilesToMol(mol_oe, new_sm)
imp_h_oe = [atom.GetExplicitHCount() for atom in mol.GetAtoms()]
print('From explicit H SMILES')
print('oe from explict h smiles: {}'.format(imp_h_oe))

mol_rd = Chem.MolFromSmiles(new_sm)
imp_h_rd = [atom.GetTotalNumHs(False) for atom in mol_rd.GetAtoms()]
print('rd from explict h smiles: {}'.format(imp_h_rd))

output:

From implicit H SMILES
oe implicit H: [1, 1, 1, 1, 1, 1]
rd implicit H: [2, 2, 2, 2, 2, 2]
From explicit H SMILES
oe from explict h smiles: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
rd from explict h smiles: [1, 1, 1, 1, 1, 1]

Corner case 2 'O1CC2=C(I1C)C=CC=C2'

image RDKit adds 2 Hs to I image

Using the same snippet as above, explicit hydrogen SMILES does not resolve the inconsistency. This is probably a bug in RDKit.

Edit: An issue already exists on this https://github.com/rdkit/rdkit/issues/1569. It doesn't look like this will get fixed anytime soon.

From implicit H SMILES
oe implicit H: [0, 2, 0, 0, 0, 3, 1, 1, 1, 1]
rd implicit H: [0, 2, 0, 0, 2, 3, 1, 1, 1, 1]
From explicit H SMILES
oe from explict h smiles: [0, 2, 0, 0, 0, 3, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
rd from explict h smiles: [1, 1, 1, 0, 0, 1, 2, 0, 2, 3]

This issue cannot always be resolved with requiring explicit hydrogen.

Given all these issues, I think it makes sense to require explicit hydrogen SMILES for CMILES. To check if a SMILES has explicit H, I can check if atom.GetImplicitHCount() > 0 since all implicit H are set to zero when explicit hydrogens are added.

However, there is no equivalent in RDKit that I am aware of. When RDKit reads an explicit hydrogen SMILES, it seems to set all hydrogens to implicit. @j-wags, how do you check that SMILES have explicit hydrogens for RDKit?

davidlmobley commented 5 years ago

@ChayaSt - is "corner case 1" really a problem? I agree that the toolkits are inconsistent which is perhaps "annoying" but ... your SMILES string is, er, "nonsense", isn't it? Specifically c-1-c-c-c-c-c-1 is "a ring of aromatic carbons connected by single bonds", which doesn't sound very feasible. (Standard "I am not a chemist" disclaimer applies though.) My first impression would be that if you give toolkits something which doesn't make chemical sense, you might expect that they will deal with it in different ways.

I also feel that way about "corner case 2", but my "I am not a chemist" disclaimer applies even more strongly here. In this case it just comes from the fact that "I expect we will only see iodine acting like a 'normal halogen' and being a terminal atom which always has valence one and a single bond to any other atom." So... I'm not terribly concerned if iodine ends up with weird numbers of hydrogens when it occurs in crazy environments. (Is there some reason this is "less crazy" than I think?)

Thanks.

ChayaSt commented 5 years ago

@davidlmobley, I mostly agree with you that we can ignore these kinds of crazy corner cases, but it is annoying that different toolkits treat them differently.

The bigger concern is the first 2 issues, where protonation states of a molecule with geometry is not conserved and that a toolkit is internally inconsistent or toolkits are inconsistent with where it places H when it is ambiguous. That's why I'm proposing restricting input to explicit H SMILES (it doesn't fully solve the third issue but that is something we can have a note about in the documentation).

The problem is that while there is a way to check openeye molecules that the SMILES used to create the molecule has explicit hydrogen, I did not yet find an equivalent check for RDKit.

davidlmobley commented 5 years ago

I'd ask the RDKit issue tracker.

And I think any cases where a toolkit is internally inconsistent should also be raised as bugs/issues with either toolkit's support folks.

j-wags commented 5 years ago

@ChayaSt So, right now I don't check for explicit hydrogens, and in fact I rely on the AddExplicitHs/AddHs functions in OE and RDK. I think what we're learning here is that this could be a problem moving forward. I'm opening an issue in openforcefield to think about this in the context of that repo's use cases. There's a good chance I'll borrow your code if you find a way to check for all explicit H's in both OE and RDK (even if I end up letting OFF permit it, I'd want to raise a warning).

I agree that restricting to explicit H SMILES is a good idea here.

I think that the corner cases are good to document, and I agree with you and David that it probably isn't critical at this stage to expect proper behavior for these specific ones.