openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
305 stars 90 forks source link

Negatively charged cysteines in the main chain are not supported. #1727

Closed pbuslaev closed 10 months ago

pbuslaev commented 10 months ago

Describe the bug I tried to to generate topology from pdb with deprotonated, but not covalently connected to other cysteine, system. Such cysteines are often observed in protein systems, especially if they are close to ions. Toolkit gave me an error, saying that some bonds or atoms in the input could not be identified.

To Reproduce I encountered the with openff-toolkit=0.14.3 (clean installation into empty conda environment).

from openff.toolkit import Molecule, Topology, ForceField
protein = Topology.from_pdb("./cyx_test.pdb")

Output This is the error message I got:

---------------------------------------------------------------------------
UnassignedChemistryInPDBError             Traceback (most recent call last)
Cell In[10], line 1
----> 1 protein = Topology.from_pdb("/home/pbuslaev/projects/gromacs_fep_pipeline/scripts/test_runs/tmp_6/cyx_test.pdb")
      2 protein_intrcg = Interchange.from_smirnoff(force_field=ff14sb, topology=protein)

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/utilities/utilities.py:80, in requires_package.<locals>.inner_decorator.<locals>.wrapper(*args, **kwargs)
     77 except Exception as e:
     78     raise e
---> 80 return function(*args, **kwargs)

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/toolkit/topology/topology.py:1758, in Topology.from_pdb(cls, file_path, unique_molecules, toolkit_registry, _custom_substructures, _additional_substructures)
   1752 substructure_dictionary["ADDITIONAL_SUBSTRUCTURE_OVERLAP"] = {}
   1754 coords_angstrom = np.array(
   1755     [[*vec3.value_in_unit(openmm_unit.angstrom)] for vec3 in pdb.getPositions()]
   1756 )
-> 1758 topology = toolkit_registry.call(
   1759     "_polymer_openmm_pdbfile_to_offtop",
   1760     cls,
   1761     pdb,
   1762     substructure_dictionary,
   1763     coords_angstrom,
   1764     _custom_substructures,
   1765 )
   1767 for off_atom, atom in zip([*topology.atoms], pdb.topology.atoms()):
   1768     off_atom.metadata["residue_name"] = atom.residue.name

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/toolkit/utils/toolkit_registry.py:356, in ToolkitRegistry.call(self, method_name, raise_exception_types, *args, **kwargs)
    354             for exception_type in raise_exception_types:
    355                 if isinstance(e, exception_type):
--> 356                     raise e
    357             errors.append((toolkit, e))
    359 # No toolkit was found to provide the requested capability
    360 # TODO: Can we help developers by providing a check for typos in expected method names?

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/toolkit/utils/toolkit_registry.py:352, in ToolkitRegistry.call(self, method_name, raise_exception_types, *args, **kwargs)
    350 method = getattr(toolkit, method_name)
    351 try:
--> 352     return method(*args, **kwargs)
    353 except Exception as e:
    354     for exception_type in raise_exception_types:

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/toolkit/utils/rdkit_wrapper.py:321, in RDKitToolkitWrapper._polymer_openmm_pdbfile_to_offtop(self, topology_class, pdbfile, substructure_dictionary, coords_angstrom, _custom_substructures)
    314 custom_substructure_dictionary = self._prepare_custom_substructures(
    315     _custom_substructures
    316 )
    317 substructure_dictionary.update(
    318     custom_substructure_dictionary
    319 )  # concats both dicts, unique keys are enforced in previous function
--> 321 rdkit_mol = self._polymer_openmm_topology_to_rdmol(
    322     omm_top, substructure_dictionary
    323 )
    325 rdmol_conformer = Chem.Conformer()
    326 for atom_idx in range(rdkit_mol.GetNumAtoms()):

File /tmp_mnt/filer1/unix/pbuslaev/conda_env/openff_dev/lib/python3.10/site-packages/openff/toolkit/utils/rdkit_wrapper.py:818, in RDKitToolkitWrapper._polymer_openmm_topology_to_rdmol(self, omm_top, substructure_library)
    807             symbols = sorted(
    808                 [
    809                     SYMBOLS[atom.GetAtomicNum()]
   (...)
    812                 ]
    813             )
    814             resname_to_symbols_and_atomnames[resname].append(
    815                 (symbols, atom_names)
    816             )
--> 818     raise UnassignedChemistryInPDBError(
    819         substructure_library=resname_to_symbols_and_atomnames,
    820         omm_top=omm_top,
    821         unassigned_atoms=unassigned_atoms,
    822         unassigned_bonds=unassigned_bonds,
    823         matches=matches,
    824     )
    826 # set some properties to later remember what matches were made
    827 for atom in mol.GetAtoms():

UnassignedChemistryInPDBError: Some bonds or atoms in the input could not be identified.

Hint: The following residues were assigned names that do not match the residue name in the input, or could not be assigned residue names at all. This may indicate that atoms are missing from the input or some other error. The OpenFF Toolkit requires all atoms, including hydrogens, to be explicit in the input to avoid ambiguities in protonation state or bond order:
    Input residue  :CYS#0002 contains atoms matching substructures {'No match', 'PEPTIDE_BOND'}

Error: The following 5 atoms exist in the input but could not be assigned chemical information from the substructure library:
    Atom     9 (HA) in residue  :CYS#0002
    Atom    10 (CB) in residue  :CYS#0002
    Atom    11 (HB2) in residue  :CYS#0002
    Atom    12 (HB3) in residue  :CYS#0002
    Atom    13 (SG) in residue  :CYS#0002

Computing environment (please complete the following information):

Additional context As far as I understand, it is the consequence of the fact, that negatively charged cysteines which are not terminal, but located in the main chain are not present in data/proteins/aa_residues_substructures_explicit_bond_orders_with_caps_explicit_connectivity.json file.

Files to reproduce cyx_test.pdb

CRYST1   48.000   48.000   48.000  90.00  90.00  90.00 P 1           1
ATOM      1  H1  ACE     1      28.573  24.032  26.099  1.00  0.00           H  
ATOM      2  CH3 ACE     1      28.403  25.075  26.364  1.00  0.00           C  
ATOM      3  H2  ACE     1      29.047  25.351  27.196  1.00  0.00           H  
ATOM      4  H3  ACE     1      27.360  25.215  26.643  1.00  0.00           H  
ATOM      5  C   ACE     1      28.726  25.945  25.173  1.00  0.00           C  
ATOM      6  O   ACE     1      29.129  25.442  24.132  1.00  0.00           O  
ATOM      7  N   CYX     2      28.533  27.252  25.319  1.00  0.00           N  
ATOM      8  H   CYX     2      28.195  27.601  26.205  1.00  0.00           H  
ATOM      9  CA  CYX     2      28.788  28.255  24.284  1.00  0.00           C  
ATOM     10  HA  CYX     2      29.726  28.000  23.788  1.00  0.00           H  
ATOM     11  CB  CYX     2      27.665  28.212  23.233  1.00  0.00           C  
ATOM     12  HB2 CYX     2      27.542  27.183  22.893  1.00  0.00           H  
ATOM     13  HB3 CYX     2      27.997  28.784  22.367  1.00  0.00           H  
ATOM     14  SG  CYX     2      26.035  28.864  23.703  1.00  0.00           S  
ATOM     15  C   CYX     2      28.948  29.658  24.899  1.00  0.00           C  
ATOM     16  O   CYX     2      28.616  29.871  26.065  1.00  0.00           O  
ATOM     17  N   NME     3      29.464  30.614  24.115  1.00  0.00           N  
ATOM     18  H   NME     3      29.717  30.370  23.171  1.00  0.00           H  
ATOM     19  CH3 NME     3      29.676  31.993  24.550  1.00  0.00           C  
ATOM     20 HH31 NME     3      28.717  32.449  24.805  1.00  0.00           H  
ATOM     21 HH32 NME     3      30.312  32.007  25.437  1.00  0.00           H  
ATOM     22 HH33 NME     3      30.151  32.572  23.757  1.00  0.00           H  
TER      23      NME     3 
END
pbuslaev commented 10 months ago

While the patch does solve the issue with the Topology, the next problem occurs when I am trying to create an interchange:

ff14sb = ForceField("ff14sb_off_impropers_0.0.3.offxml")
protein_intrcg = Interchange.from_smirnoff(force_field=ff14sb, topology=protein)

This is due to the fact the negatively charged cysteine is not a part of force field. It seems that it can be easily ported from amber, since AMBER14SB supports CYM, which is negatively charged cysteine. Thus, I think that by adding CYM to residue list in generate and convert scripts of amber-ff-porting should solve the issue. Unfortunately, I can not do it myself, since I do not have license for openeye. I wonder, if it is worth trying to rework convert tools with say rdkit. If I read the code correctly, the main dependency in amber-ff-porting on openeye, is with SMARTS generation, which should be possible to do with rdkit. In general, I think it can be useful to make porting independent of openeye, since one might need to port new (e.g. modified) residues to the amber forcefield.

j-wags commented 10 months ago

Hi @pbuslaev, thanks for the really great report and PR. It turns out that the substructure loading library DOES have CYM, but for some reason it's not accepting this reasonable-looking input.

For the PDB-loading-substructure stuff - we try to maintain a processing pipeline from the RCSB chemical component dictionary to our substructures. So instead of modifying our big substructure data file directly, I'll point out on that PR where we can modify the processing script to add the patch.

For the AMBER ff14sb port changes, I can't recall why we didn't port CYM initially. I think we just assumed it was so rare that it wasn't worth covering. Getting that pipeline running again will take some time that I don't currently have. I agree it's unfortunate that it's an OpenEye-dependent workflow - IIRC it was because OE had some functionality that we couldn't easily find in RDKit. I'll open an issue on that repo to track the request but I unfortunately don't anticipate having time to action it any time soon. I'll ask around internally to see if anyone can do it though!