Closed wenyan4work closed 2 years ago
Short answer: Yes
Medium answer: Aromaticity is perceived using the MDL model both in the Molecule
loading layer (which matters when you call Molecule.from_foo
) and the SMIRNOFF force field definition (which matters when you call ForceField.create_openm_system
or ForceField.label_molecules
). In principle these two models could differ, but at the moment only MDL is implemented in either.
For reasons I vaguely understand (and was surprised to learn!) but would do a poor job of reiterating, MDL considers the bonds in thiophene to be non-aromatic:
In [1]: from openeye import oechem
In [2]: from openff.toolkit import Molecule
In [3]: oemol = Molecule.from_smiles("c1ccSc1").to_openeye()
In [4]: oechem.OEAssignAromaticFlags(oemol, oechem.OEAroModel_MDL)
In [5]: for atom in oemol.GetAtoms(): # Same results if you look at bonds
...: print(atom.GetIdx(), atom.IsAromatic())
...:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
so, given this, it's not a surprise that it's using the generic single- and double-bonded sp2 carbon parameters. The trick here is that aromaticity information is stored separately from bond order in Bond.is_aromatic
or Atom.is_aromatic
. The toolkit is informed (via OpenEye or RDKit) that MDL considers thiophene non-aromatic and benzene aromatic and stores this information - even though the bonds themselves are kekulized, as you noticed:
In [2]: [b.is_aromatic for b in Molecule.from_smiles("c1ccSc1").bonds]
Out[2]: [False, False, False, False, False, False, False, False, False]
In [3]: [b.is_aromatic for b in Molecule.from_smiles("c1ccccc1").bonds]
Out[3]: [True, True, True, True, True, True, False, False, False, False, False, False]
In [4]: [b.bond_order for b in Molecule.from_smiles("c1ccSc1").bonds]
Out[4]: [2, 1, 2, 1, 1, 1, 1, 1, 1]
In [5]: [b.bond_order for b in Molecule.from_smiles("c1ccccc1").bonds]
Out[5]: [2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1]
So you're getting the behavior: based on MDL definitions, thiophene should get b4
and b6
but benzene should get b5
.
Longer answer: This is a long-standing point of friction in our infrastructure and related tools, mostly because molecule inputs sometimes contain incomplete information. If you're somehow short on reading material, openforcefield/openff-toolkit#697 includes a thorough summary of how a few corner cases can compound upon each other.
Thank you very much for the detailed explanation! @mattwthompson However, I noticed something strange (possibly due to the differences between openeye and rdkit). You mentioned
MDL considers the bonds in thiophene to be non-aromatic
I tried this in rdkit:
rkmol = Chem.MolFromSmiles('c1cscc1')
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY)
Chem.SetAromaticity(rkmol, Chem.AromaticityModel.AROMATICITY_MDL) # same result if using AROMATICITY_RDKIT
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
Chem.Kekulize(rkmol)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
and get
['AROMATIC', 'AROMATIC', 'AROMATIC', 'AROMATIC', 'AROMATIC']
[True, True, True, True, True]
['SINGLE', 'DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE']
[True, True, True, True, True]
rdkit does have a note saying the MDL aromaticity model is not well documented: https://www.rdkit.org/docs/RDKit_Book.html#the-mdl-aromaticity-model
Did I do anything wrong in this example or does this mean thiophene is a case where OpenEye and RDKit behaves differently?
I'm not sure why that code is reporting aromaticity, I don't get the same result while going through OpenFF:
In [4]: rdmol = Molecule.from_smiles("c1ccSc1").to_rdkit()
In [5]: [b.GetIsAromatic() for b in rdmol.GetBonds()]
Out[5]: [False, False, False, False, False, False, False, False, False]
the docs you linked include a mention of five-membered rings always being non-aromatic so I suspect some detail is off in your code. Somebody else who is better with RDKit will probably be able to be of more help than me!
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY)
I think that the Chem.SetAromaticity call may not be overwriting the previous values, which are set by the SanitizeMol call. If I instead have the sanitize step skip setting aromaticity, it shows what we would expect:
from rdkit import Chem
rkmol = Chem.MolFromSmiles('c1cscc1')
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY ^ Chem.SANITIZE_SETAROMATICITY)
#Chem.Kekulize(rkmol, clearAromaticFlags=True) # This is another way to blow away pre-existing aro info
Chem.SetAromaticity(rkmol, Chem.AromaticityModel.AROMATICITY_MDL)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
Chem.Kekulize(rkmol)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
['DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE', 'SINGLE']
[False, False, False, False, False]
['DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE', 'SINGLE']
[False, False, False, False, False]
Thank you. It is somewhat related to the Kekulize function in rdkit: https://github.com/openforcefield/openff-toolkit/blob/ed6549ebdb0149db326d489f0b5b143b5607631a/openff/toolkit/utils/rdkit_wrapper.py#L447
if I use this, as is being used in openff in the above link
Chem.Kekulize(rkmol, clearAromaticFlags=True)
I got the same result as openff-toolkit.
Excellent. Thanks for opening the high-quality issue @wenyan4work. We've noticed a similar issue in our work to make a protein force field - The MDL aromaticity model doesn't like some amino acid side chains with 5-membered rings. So we try (and at least mostly succeed) to write the SMIRKS in our force fields so that they assign parameters that keep these sorts of rings flat, even if the aromaticity model doesn't cooperate :-P
So I think the answer is "everything is working as intended, it just looks funny". Did this answer your original question?
Thank you @j-wags . Yes my original question has been answered.
I think probably the answer is 'rdkit may benefit from some crowd-funded document and demostration' :)
I was playing with the 'inspect_assigned_parameters' script in example folder, and tried to label a thiophene ring. The only thing I changed is
molecule = Molecule.from_smiles("c1ccSc1")
in the following codeI got bonds:
All CC and CS bonds are either single or double.
Same thing happens for a simpler benzene. I got bonds:
There is actually 'aromatic' bond definition in the offxml file (b5):
What is the correct behavior?