openforcefield / discussions

Contains organisation-wide discussions relating to science, infrastructure, etc.
MIT License
0 stars 0 forks source link

Are aromatic rings kekulized before assigning parameters? #11

Closed wenyan4work closed 2 years ago

wenyan4work commented 2 years ago

I was playing with the 'inspect_assigned_parameters' script in example folder, and tried to label a thiophene ring. The only thing I changed is molecule = Molecule.from_smiles("c1ccSc1") in the following code

from openff.toolkit import ForceField, Molecule, Topology

# Create a simple molecule from SMILES and turn it into a topology.
molecule = Molecule.from_smiles("c1ccSc1")
topology = Topology.from_molecules([molecule])

# Let's label using the Sage force field
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")

# Run the molecule labeling
molecule_force_list = forcefield.label_molecules(topology)

I got bonds:

Bonds:
atoms:   0  1  parameter_id: b4  smirks [#6X3:1]-[#6X3:2]
atoms:   0  4  parameter_id: b6  smirks [#6X3:1]=[#6X3:2]
atoms:   0  5  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   1  2  parameter_id: b6  smirks [#6X3:1]=[#6X3:2]
atoms:   1  6  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   2  3  parameter_id: b52  smirks [#16X2,#16X1-1,#16X3+1:1]-[#6X3:2]
atoms:   2  7  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   3  4  parameter_id: b52  smirks [#16X2,#16X1-1,#16X3+1:1]-[#6X3:2]
atoms:   4  8  parameter_id: b85  smirks [#6X3:1]-[#1:2]

All CC and CS bonds are either single or double.

Same thing happens for a simpler benzene. I got bonds:

Bonds:
atoms:   0  1  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   0  5  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   0  6  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   1  2  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   1  7  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   2  3  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   2  8  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   3  4  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   3  9  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   4  5  parameter_id: b5  smirks [#6X3:1]:[#6X3:2]
atoms:   4 10  parameter_id: b85  smirks [#6X3:1]-[#1:2]
atoms:   5 11  parameter_id: b85  smirks [#6X3:1]-[#1:2]

There is actually 'aromatic' bond definition in the offxml file (b5):

        <Bond smirks="[#6X3:1]-[#6X3:2]" id="b4" length="1.454628925034 * angstrom" k="558.2701926808 * angstrom**-2 * mole**-1 * kilocalorie"></Bond>
        <Bond smirks="[#6X3:1]:[#6X3:2]" id="b5" length="1.387193227181 * angstrom" k="721.5704889544 * angstrom**-2 * mole**-1 * kilocalorie"></Bond>
        <Bond smirks="[#6X3:1]=[#6X3:2]" id="b6" length="1.371688562017 * angstrom" k="798.318590659 * angstrom**-2 * mole**-1 * kilocalorie"></Bond>

What is the correct behavior?

mattwthompson commented 2 years ago

Short answer: Yes

Medium answer: Aromaticity is perceived using the MDL model both in the Molecule loading layer (which matters when you call Molecule.from_foo) and the SMIRNOFF force field definition (which matters when you call ForceField.create_openm_system or ForceField.label_molecules). In principle these two models could differ, but at the moment only MDL is implemented in either.

For reasons I vaguely understand (and was surprised to learn!) but would do a poor job of reiterating, MDL considers the bonds in thiophene to be non-aromatic:

In [1]: from openeye import oechem

In [2]: from openff.toolkit import Molecule

In [3]: oemol = Molecule.from_smiles("c1ccSc1").to_openeye()

In [4]: oechem.OEAssignAromaticFlags(oemol, oechem.OEAroModel_MDL)

In [5]: for atom in oemol.GetAtoms():  # Same results if you look at bonds
   ...:     print(atom.GetIdx(), atom.IsAromatic())
   ...:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False

so, given this, it's not a surprise that it's using the generic single- and double-bonded sp2 carbon parameters. The trick here is that aromaticity information is stored separately from bond order in Bond.is_aromatic or Atom.is_aromatic. The toolkit is informed (via OpenEye or RDKit) that MDL considers thiophene non-aromatic and benzene aromatic and stores this information - even though the bonds themselves are kekulized, as you noticed:

In [2]: [b.is_aromatic for b in Molecule.from_smiles("c1ccSc1").bonds]
Out[2]: [False, False, False, False, False, False, False, False, False]

In [3]: [b.is_aromatic for b in Molecule.from_smiles("c1ccccc1").bonds]
Out[3]: [True, True, True, True, True, True, False, False, False, False, False, False]

In [4]: [b.bond_order for b in Molecule.from_smiles("c1ccSc1").bonds]
Out[4]: [2, 1, 2, 1, 1, 1, 1, 1, 1]

In [5]: [b.bond_order for b in Molecule.from_smiles("c1ccccc1").bonds]
Out[5]: [2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1]

So you're getting the behavior: based on MDL definitions, thiophene should get b4 and b6 but benzene should get b5.

Longer answer: This is a long-standing point of friction in our infrastructure and related tools, mostly because molecule inputs sometimes contain incomplete information. If you're somehow short on reading material, openforcefield/openff-toolkit#697 includes a thorough summary of how a few corner cases can compound upon each other.

wenyan4work commented 2 years ago

Thank you very much for the detailed explanation! @mattwthompson However, I noticed something strange (possibly due to the differences between openeye and rdkit). You mentioned

MDL considers the bonds in thiophene to be non-aromatic

I tried this in rdkit:

rkmol = Chem.MolFromSmiles('c1cscc1')
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY)
Chem.SetAromaticity(rkmol, Chem.AromaticityModel.AROMATICITY_MDL) # same result if using AROMATICITY_RDKIT
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
Chem.Kekulize(rkmol)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])

and get

['AROMATIC', 'AROMATIC', 'AROMATIC', 'AROMATIC', 'AROMATIC']
[True, True, True, True, True]
['SINGLE', 'DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE']
[True, True, True, True, True]

rdkit does have a note saying the MDL aromaticity model is not well documented: https://www.rdkit.org/docs/RDKit_Book.html#the-mdl-aromaticity-model

Did I do anything wrong in this example or does this mean thiophene is a case where OpenEye and RDKit behaves differently?

mattwthompson commented 2 years ago

I'm not sure why that code is reporting aromaticity, I don't get the same result while going through OpenFF:

In [4]: rdmol = Molecule.from_smiles("c1ccSc1").to_rdkit()

In [5]: [b.GetIsAromatic() for b in rdmol.GetBonds()]
Out[5]: [False, False, False, False, False, False, False, False, False]

the docs you linked include a mention of five-membered rings always being non-aromatic so I suspect some detail is off in your code. Somebody else who is better with RDKit will probably be able to be of more help than me!

j-wags commented 2 years ago
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY)

I think that the Chem.SetAromaticity call may not be overwriting the previous values, which are set by the SanitizeMol call. If I instead have the sanitize step skip setting aromaticity, it shows what we would expect:

from rdkit import Chem
rkmol = Chem.MolFromSmiles('c1cscc1')
Chem.SanitizeMol(rkmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_CLEANUPCHIRALITY ^ Chem.SANITIZE_SETAROMATICITY)
#Chem.Kekulize(rkmol, clearAromaticFlags=True) # This is another way to blow away pre-existing aro info
Chem.SetAromaticity(rkmol, Chem.AromaticityModel.AROMATICITY_MDL)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
Chem.Kekulize(rkmol)
print([str(b.GetBondType()) for b in rkmol.GetBonds()])
print([at.GetIsAromatic() for at in rkmol.GetAtoms()])
['DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE', 'SINGLE']
[False, False, False, False, False]
['DOUBLE', 'SINGLE', 'SINGLE', 'DOUBLE', 'SINGLE']
[False, False, False, False, False]
wenyan4work commented 2 years ago

Thank you. It is somewhat related to the Kekulize function in rdkit: https://github.com/openforcefield/openff-toolkit/blob/ed6549ebdb0149db326d489f0b5b143b5607631a/openff/toolkit/utils/rdkit_wrapper.py#L447

if I use this, as is being used in openff in the above link

Chem.Kekulize(rkmol, clearAromaticFlags=True)

I got the same result as openff-toolkit.

j-wags commented 2 years ago

Excellent. Thanks for opening the high-quality issue @wenyan4work. We've noticed a similar issue in our work to make a protein force field - The MDL aromaticity model doesn't like some amino acid side chains with 5-membered rings. So we try (and at least mostly succeed) to write the SMIRKS in our force fields so that they assign parameters that keep these sorts of rings flat, even if the aromaticity model doesn't cooperate :-P

So I think the answer is "everything is working as intended, it just looks funny". Did this answer your original question?

wenyan4work commented 2 years ago

Thank you @j-wags . Yes my original question has been answered.

I think probably the answer is 'rdkit may benefit from some crowd-funded document and demostration' :)