Open ChayaSt opened 5 years ago
@ChayaSt - is "corner case 1" really a problem? I agree that the toolkits are inconsistent which is perhaps "annoying" but ... your SMILES string is, er, "nonsense", isn't it? Specifically c-1-c-c-c-c-c-1
is "a ring of aromatic carbons connected by single bonds", which doesn't sound very feasible. (Standard "I am not a chemist" disclaimer applies though.) My first impression would be that if you give toolkits something which doesn't make chemical sense, you might expect that they will deal with it in different ways.
I also feel that way about "corner case 2", but my "I am not a chemist" disclaimer applies even more strongly here. In this case it just comes from the fact that "I expect we will only see iodine acting like a 'normal halogen' and being a terminal atom which always has valence one and a single bond to any other atom." So... I'm not terribly concerned if iodine ends up with weird numbers of hydrogens when it occurs in crazy environments. (Is there some reason this is "less crazy" than I think?)
Thanks.
@davidlmobley, I mostly agree with you that we can ignore these kinds of crazy corner cases, but it is annoying that different toolkits treat them differently.
The bigger concern is the first 2 issues, where protonation states of a molecule with geometry is not conserved and that a toolkit is internally inconsistent or toolkits are inconsistent with where it places H when it is ambiguous. That's why I'm proposing restricting input to explicit H SMILES (it doesn't fully solve the third issue but that is something we can have a note about in the documentation).
The problem is that while there is a way to check openeye molecules that the SMILES used to create the molecule has explicit hydrogen, I did not yet find an equivalent check for RDKit.
I'd ask the RDKit issue tracker.
And I think any cases where a toolkit is internally inconsistent should also be raised as bugs/issues with either toolkit's support folks.
@ChayaSt So, right now I don't check for explicit hydrogens, and in fact I rely on the AddExplicitHs
/AddHs
functions in OE and RDK. I think what we're learning here is that this could be a problem moving forward. I'm opening an issue in openforcefield to think about this in the context of that repo's use cases. There's a good chance I'll borrow your code if you find a way to check for all explicit H's in both OE and RDK (even if I end up letting OFF permit it, I'd want to raise a warning).
I agree that restricting to explicit H SMILES is a good idea here.
I think that the corner cases are good to document, and I agree with you and David that it probably isn't critical at this stage to expect proper behavior for these specific ones.
We want to ensure that incoming SMILES have unambiguous protonation / tautomer states. There are several issues to look out for:
Corner case 1
c-1-c-c-c-c-c-1'
Using explicit hydrogen SMILES helps here. When reading this SMILES, OE adds one H per carbon, RDKit adds 2 H per carbon.
output:
Corner case 2
'O1CC2=C(I1C)C=CC=C2'
RDKit adds 2 Hs to I
Using the same snippet as above, explicit hydrogen SMILES does not resolve the inconsistency. This is probably a bug in RDKit.
Edit: An issue already exists on this https://github.com/rdkit/rdkit/issues/1569. It doesn't look like this will get fixed anytime soon.
This issue cannot always be resolved with requiring explicit hydrogen.
Given all these issues, I think it makes sense to require explicit hydrogen SMILES for CMILES. To check if a SMILES has explicit H, I can check if
atom.GetImplicitHCount() > 0
since all implicit H are set to zero when explicit hydrogens are added.However, there is no equivalent in RDKit that I am aware of. When RDKit reads an explicit hydrogen SMILES, it seems to set all hydrogens to implicit. @j-wags, how do you check that SMILES have explicit hydrogens for RDKit?