openforcefield / amber-ff-porting

Scratch space for porting amber FFs into SMIRNOFF format
1 stars 3 forks source link

C termini are percieved to have a -3 charge #2

Open j-wags opened 4 years ago

j-wags commented 4 years ago

During SMARTS creation, the C termini seem to have a bond perception problem, yielding things like [H][C@:3]1([C:2]([C:1](C(N1)([H])[H])([H])[H])([H])[H])[C-]([O-])[O-]

This Issue is to figure out why the final three atoms all have negative charges, while an uncapped C terminal should only have one

j-wags commented 4 years ago

I've added a reproducing example to the repo. My first attempt was to try converting the PRO.mol2 file to SDF, to see if openbabel could fix it:

(off-dev) jeffreywagner@JW-MBP$ cat reproduce.py
from openforcefield.topology import Molecule
import os
os.system('obabel -imol2 /Users/jeffreywagner/projects/OpenForceField/openforcefield/examples/temp/2020-05-21-amber-ff-porting/tests/issue_2_c_term_charge/CTerminal/PRO/PRO.mol2 -osdf -O /Users/jeffreywagner/projects/OpenForceField/openforcefield/examples/temp/2020-05-21-amber-ff-porting/tests/issue_2_c_term_charge/CTerminal/PRO/PRO.sdf')

mol = Molecule.from_file('/Users/jeffreywagner/projects/OpenForceField/openforcefield/examples/temp/2020-05-21-amber-ff-porting/tests/issue_2_c_term_charge/CTerminal/PRO/PRO.sdf')
print('sdf smiles', mol.to_smiles())
mol = Molecule.from_file('/Users/jeffreywagner/projects/OpenForceField/openforcefield/examples/temp/2020-05-21-amber-ff-porting/tests/issue_2_c_term_charge/CTerminal/PRO/PRO.mol2')
print('mol2 smiles', mol.to_smiles())
(off-dev) jeffreywagner@JW-MBP$ python reproduce.py
1 molecule converted
sdf smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])[C]([O-])[O-]
mol2 smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])[C-]([O-])[O-]

Somehow, converting to SDF removes the negative charge from the carbon, without affecting the oxygens. This should be an impossible kekule structure, which is confusing.

Changing bond orders in the mol2 file

Looking closely at the PRO.mol2 file, I'd expect the C(O)(O) motif to either have one double bond, or two aromatic bonds. But here, it just has two single bonds:

(off-dev) jeffreywagner@JW-MBP$ cat CTerminal/PRO/PRO.mol2 
@<TRIPOS>MOLECULE
default_name
   21    21     1     0     0
SMALL
No Charge or Current Charge

@<TRIPOS>ATOM
      1 H1          26.0920    24.8880    24.0010 H          1 ACE       0.112300
...
     18 HA          29.4170    28.2960    23.1440 H          2 CPR      -0.567900
     19 C           29.5880    28.3690    25.2780 C.2        2 CPR      -0.567900
     20 O           28.9360    28.2350    26.3400 O.co2      2 CPR      -0.567900
     21 OXT         30.8350    28.4020    25.1910 O.co2      2 CPR      -0.567900
@<TRIPOS>BOND
...
    19     8    11 1   
    20     7     8 1   
    21     7    17 1   
@<TRIPOS>SUBSTRUCTURE
     1 ACE         1 TEMP              0 ****  ****    0 ROOT

If I change the last few bonds to type am, I get the following.

(off-dev) jeffreywagner@JW-MBP$ cat CTerminal/PRO/PRO.mol2 
...
    19     8    11 1   
    20     7     8 am   
    21     7    17 am   
@<TRIPOS>SUBSTRUCTURE
...

(off-dev) jeffreywagner@JW-MBP$ python reproduce.py 1 molecule converted sdf smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])C[O-] mol2 smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])C-[O-]

which is exactly the same as before. ~The same happens if I change them to order 1 and 2 explicitly.~ (Note from the future: This is incorrect, and setting either of the bonds to order 2 DOES end up solving this issue)

Changing reader formats

I've also tried changing the OE reading function to both MOL2 and MOL2H modes explicitly, with no luck

j-wags commented 4 years ago

This seems to be a consistent problem with carboxylate groups. For example, loading MainChain/ALA_ASP/ALA_ASP.mol2[1] (with capped termini) yields the same -3 charge:

from openeye import oechem
ifs = oechem.oemolistream('/XXX/MainChain/ALA_ASP/ALA_ASP.mol2')
ifs.SetFormat(oechem.OEFormat_MOL2)
mols = list()
oemol = oechem.OEMol()
while oechem.OEReadMolecule(ifs, oemol):
    oechem.OEPerceiveChiral(oemol)
    oechem.OEAssignAromaticFlags(oemol, oechem.OEAroModel_MDL)
    smiles_options = oechem.OESMILESFlag_Canonical | oechem.OESMILESFlag_Isotopes | oechem.OESMILESFlag_RGroups
    smiles_options |= oechem.OESMILESFlag_AtomStereo | oechem.OESMILESFlag_BondStereo
    smiles_options |= oechem.OESMILESFlag_Hydrogens
    oechem.OE3DToInternalStereo(oemol)
    print(oechem.OECreateSmiString(oemol, smiles_options))

[H][C@@](C(=O)N([H])[C@]([H])(C(=O)N([H])C([H])([H])[H])C([H])([H])[C-]([O-])[O-])(C([H])([H])[H])N([H])C(=O)C([H])([H])[H]

[1]

(off-dev) jeffreywagner@JW-MBP$ cat MainChain/ALA_ASP/ALA_ASP.mol2 
@<TRIPOS>MOLECULE
default_name
   34    33     1     0     0
SMALL
No Charge or Current Charge

@<TRIPOS>ATOM
      1 H1          25.8770    25.2200    24.4630 H          1 ACE       0.112300
      2 CH3         25.9500    26.2290    24.0600 C.3        1 ACE      -0.366200
      3 H2          25.3070    26.8990    24.6280 H          1 ACE       0.112300
      4 H3          25.6560    26.2300    23.0130 H          1 ACE       0.112300
      5 C           27.3820    26.6990    24.1750 C.2        1 ACE       0.597200
      6 O           28.2290    25.9620    24.6620 O.2        1 ACE      -0.567900
      7 N           27.6380    27.9240    23.7250 N.am       2 ALA      -0.415700
      8 H           26.8750    28.4830    23.3720 H          2 ALA       0.271900
      9 CA          28.9320    28.6040    23.7970 C.3        2 ALA       0.033700
     10 HA          29.4200    28.3430    24.7380 H          2 ALA       0.082300
     11 CB          29.8160    28.1330    22.6330 C.3        2 ALA      -0.182500
     12 HB1         29.3430    28.3840    21.6820 H          2 ALA       0.060300
     13 HB2         30.7890    28.6220    22.6870 H          2 ALA       0.060300
     14 HB3         29.9610    27.0530    22.6900 H          2 ALA       0.060300
     15 C           28.7340    30.1320    23.7680 C.2        2 ALA       0.597300
     16 O           27.6840    30.6090    23.3400 O.2        2 ALA      -0.567900
     17 N           29.7520    30.8790    24.1980 N.am       3 ASP      -0.516300
     18 H           30.6100    30.4250    24.4750 H          3 ASP       0.293600
     19 CA          29.7940    32.3470    24.2250 C.3        3 ASP       0.038100
     20 HA          29.2090    32.7390    23.3910 H          3 ASP       0.088000
     21 CB          29.1610    32.8520    25.5390 C.3        3 ASP      -0.030300
     22 HB2         28.2090    32.3390    25.6890 H          3 ASP      -0.012200
     23 HB3         29.8170    32.5870    26.3710 H          3 ASP      -0.012200
     24 CG          28.8850    34.3630    25.5750 C.2        3 ASP       0.799400
     25 OD1         28.9520    35.0080    24.5030 O.co2      3 ASP      -0.801400
     26 OD2         28.6260    34.8650    26.6890 O.co2      3 ASP      -0.801400
     27 C           31.2540    32.8290    24.0580 C.2        3 ASP       0.536600
     28 O           32.1890    32.0250    24.1430 O.2        3 ASP      -0.581900
     29 N           31.4590    34.1240    23.8040 N.am       4 NME      -0.415700
     30 H           30.6300    34.7210    23.8310 H          4 NME       0.271900
     31 CH3         32.7660    34.7490    23.6330 C.3        4 NME      -0.149000
     32 HH31        33.3710    34.1520    22.9490 H          4 NME       0.097600
     33 HH32        32.6430    35.7550    23.2310 H          4 NME       0.097600
     34 HH33        33.2690    34.8040    24.6000 H          4 NME       0.097600
@<TRIPOS>BOND
     1     5     6 2   
     2     5     7 am  
     3     2     3 1   
     4     2     4 1   
     5     2     5 1   
     6     1     2 1   
     7    15    16 2   
     8    15    17 am  
     9    11    12 1   
    10    11    13 1   
    11    11    14 1   
    12     9    10 1   
    13     9    11 1   
    14     9    15 1   
    15     7     8 1   
    16     7     9 1   
    17    27    28 2   
    18    27    29 am  
    19    24    25 1   
    20    24    26 1   
    21    21    22 1   
    22    21    23 1   
    23    21    24 1   
    24    19    20 1   
    25    19    21 1   
    26    19    27 1   
    27    17    18 1   
    28    17    19 1   
    29    31    32 1   
    30    31    33 1   
    31    31    34 1   
    32    29    30 1   
    33    29    31 1   
@<TRIPOS>SUBSTRUCTURE
     1 ACE         1 TEMP              0 ****  ****    0 ROOT
j-wags commented 4 years ago

Ah, I've retraced my steps and found that my previous attempt to change bond orders was incorrect -- I didn't modify the correct bonds (instead, it looks like I made some random C-H bonds aromatic).

Manually changing bond orders makes the carboxylate get interpreted correctly

So, taking a single proline with an uncapped C terminal [1], I can change the bond entry section from:

     6     1     2 1
     7    19    20 1
     8    19    21 1
     9    17    18 1

to

     6     1     2 1
     7    19    20 2
     8    19    21 1
     9    17    18 1

which yields the correct behavior:

(off-dev) jeffreywagner@JW-MBP$ python reproduce.py

1 molecule converted sdf smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])C(=O)[O] mol2 smiles [H][C@]1(C(C(C(N1C(=O)C([H])([H])[H])([H])[H])([H])[H])([H])[H])C(=O)[O-]

(Now the sdf smiles gets mangled somehow, but that could be an OpenBabel issue, so I'm not going to focus on it here)

[1]

(off-dev) jeffreywagner@JW-MBP$ cat CTerminal/PRO/PRO.mol2 
@<TRIPOS>MOLECULE
default_name
   21    21     1     0     0
SMALL
No Charge or Current Charge

@<TRIPOS>ATOM
      1 H1          26.0920    24.8880    24.0010 H          1 ACE       0.112300
      2 CH3         25.9970    25.9720    24.0070 C.3        1 ACE      -0.366200
      3 H2          25.6220    26.2960    24.9760 H          1 ACE       0.112300
      4 H3          25.3290    26.2900    23.2110 H          1 ACE       0.112300
      5 C           27.3720    26.5810    23.7870 C.2        1 ACE       0.597200
      6 O           28.2780    25.8270    23.4420 O.2        1 ACE      -0.567900
      7 N           27.5140    27.9110    23.9390 N.am       2 CPR      -0.567900
      8 CD          26.4410    28.8820    24.0610 C.3        2 CPR      -0.567900
      9 HD2         26.2330    29.0550    25.1180 H          2 CPR      -0.567900
     10 HD3         25.5380    28.5690    23.5390 H          2 CPR      -0.567900
     11 CG          26.9990    30.1490    23.4240 C.3        2 CPR      -0.567900
     12 HG2         26.4950    31.0430    23.7910 H          2 CPR      -0.567900
     13 HG3         26.9210    30.0770    22.3380 H          2 CPR      -0.567900
     14 CB          28.4690    30.1130    23.8390 C.3        2 CPR      -0.567900
     15 HB2         28.5800    30.6140    24.8030 H          2 CPR      -0.567900
     16 HB3         29.1070    30.5910    23.0950 H          2 CPR      -0.567900
     17 CA          28.8000    28.6180    23.9820 C.3        2 CPR      -0.567900
     18 HA          29.4170    28.2960    23.1440 H          2 CPR      -0.567900
     19 C           29.5880    28.3690    25.2780 C.2        2 CPR      -0.567900
     20 O           28.9360    28.2350    26.3400 O.co2      2 CPR      -0.567900
     21 OXT         30.8350    28.4020    25.1910 O.co2      2 CPR      -0.567900
@<TRIPOS>BOND
     1     5     6 2   
     2     5     7 am  
     3     2     3 1   
     4     2     4 1   
     5     2     5 1   
     6     1     2 1   
     7    19    20 2   
     8    19    21 1  
     9    17    18 1   
    10    17    19 1   
    11    14    15 1   
    12    14    16 1   
    13    14    17 1   
    14    11    12 1   
    15    11    13 1   
    16    11    14 1   
    17     8     9 1   
    18     8    10 1   
    19     8    11 1   
    20     7     8 1   
    21     7    17 1   
@<TRIPOS>SUBSTRUCTURE
     1 ACE         1 TEMP              0 ****  ****    0 ROOT
j-wags commented 4 years ago

So at this point, I don't think that the tleap-outputted mol2 file is spec-compliant (since it makes sp2 carbons with three single bonds). Now the question is "how do we get AmberTools to output correct bond orders?".

Previous history

I see a little history on the AMBER mailing list for this question

Some other things I tried

Additional info

The TRIPOS mol2 specification document can be found here: http://chemyang.ccnu.edu.cn/ccb/server/AIMMS/mol2.pdf

j-wags commented 4 years ago

I'm going to reopen this issue, but remove the blocking tag. It would be good to push this upstream into the AMBER stack.

davidlmobley commented 4 years ago

Yes, this has been an issue for eons.

I think nitros might have a problem too? Vague recollection.

dscerutti commented 4 years ago

I've started a thread on the Amber Developers' listserv, we'll see what happens.

Dave

On Tue, May 26, 2020 at 8:14 PM David L. Mobley notifications@github.com wrote:

Yes, this has been an issue for eons.

I think nitros might have a problem too? Vague recollection.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openforcefield/amber-ff-porting/issues/2#issuecomment-634346905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEDWD67TUD5W6GUDTCLT6H3RTRLOBANCNFSM4NHHRFNA .

j-wags commented 4 years ago

Good to know. Thanks!

For now I'm going to leave the bond order fix in this repo, but if the tleap --> mol2 --> OFFTK pathway starts getting more traffic, then we can add the fix into OFFTK (we already check for AMBER atom types when reading mol2, so this wouldn't be unprecedented).