Add phosphorylated residues SEP and PTR (pSer and pTyr)

sukritsingh commented 1 year ago

This PR provides the following changes in reference to #259 :

Adds two new residue templates SEP and PTR (pSer and pTyr respectively). Note that Hydrogens were removed from the downloaded structures to be consistent with the other Template files.
The script pdbfixer/pdbfixer.py is updated to add these residues as well.
Updates devtools/createSoftForcefield.py for the current state of atomType objects
Wrote out an updated pdbfixer/soft.xml file using the newly updated script devtools/createSoftForcefield.py

peastman commented 1 year ago

Lots of tests are failing with errors like this:

Exception: Multiple non-identical matching templates found for residue 4 (ASP): ASH, ASP.

With the original version, we manually removed the duplicates in #132 (see also #131). We could try to add logic to the script to automatically remove duplicates, but just doing it by hand again is probably simplest.

sukritsingh commented 1 year ago

I looked at the failure log and it looks like there were only four residues recognized as duplicates are causing the failures: LYN, CYM, GLH, and ASH.

I've deleted these four residue templates like was done in #132 (but in #132 a different set of residues were deleted than here, interestingly).

Hopefully this fixes things!

peastman commented 1 year ago

There are a few others that need to be removed too: CYX, HID, HIE. Also rename HIP to HIS. (HID, HIE, and HIP are all variants of histidine which are identical after removing hydrogens.) Then for each residue you've removed, there are separate templates corresponding to N- and C- terminal residues. Their names are the same but with N or C at the beginning. The same changes need to be made to them. If you're not sure about something, take a look at the diff of #132.

sukritsingh commented 1 year ago

Ah I see! Didn't want to mess around and delete established residues so I wasn't sure. I've gone ahead and deleted CYX, HID, HIE, and renamed HIP->HIS. I then did the same but for N- and C-terminal versions of any of these residues (or the ones I deleted in the previous commit).

I went through the diff at #132 and I think all the residues deleted there are now correspondingly removed from soft.xml here. I can't think of anything else that would cause issues?

peastman commented 1 year ago

This all looks good now. I assume you've tested it on a file containing those residues to make sure it works? I would suggest adding one to test_build_and_simulate.py, but it would just replace them with standard residues since that's still the default behavior for them.

sukritsingh commented 1 year ago

Oh I haven't yet! I do have a structure of a kinase (attached below) and I'll test it on that...the only test I can think of at the moment is to load it in, protonate it, and then simulate it...anything else I should try out?

pser-ripk2-confs.zip

peastman commented 1 year ago

That's fine. Just to make sure it's able to build missing residues of the new types correctly.

sukritsingh commented 1 year ago

Hm, looks like protonation is an issue...

I just ran:

fixer = pdbfixer.PDBFixer("/Users/singhs15/Downloads/pser-ripk2-conf1.pdb")
fixer.findMissingAtoms()
fixer.addMissingAtoms()
        pH = 7.0
        ionic = 50.0 * unit.millimolar
        box = 10.0 * unit.angstrom
        positiveIon = 'Na+'
        negativeIon = 'Cl-'
fixer.addMissingHydrogens(pH)
app.PDBFile.writeFile(fixer.topology, fixer.positions)

and it wrote out a PDB file without complaint, but SEP isn't protonated (every other residue IS protonated though). In retrospect this isn't surprising since I don't think I ever specified anything about protonation of SEP residues?

Should the template of SEP I include just have the protons included by default? I didn't include hydrogens in SEP.pdb because I thought none of the template PDB files should be protonated?

peastman commented 1 year ago

Adding hydrogens is done by OpenMM's Modeller class. It has its own internal specification of what hydrogens to add to what residues, and of course it doesn't include these two. You can call loadHydrogenDefinitions() to load a file with additional residues that you create yourself. Here's some relevant documentation.

http://docs.openmm.org/latest/userguide/application/03_model_building_editing.html#adding-hydrogens http://docs.openmm.org/latest/api-python/generated/openmm.app.modeller.Modeller.html#openmm.app.modeller.Modeller.loadHydrogenDefinitions

And here's the file with the built in definitions. The format should hopefully be clear.

https://github.com/openmm/openmm/blob/master/wrappers/python/openmm/app/data/hydrogens.xml

This does raise an issue: this feature can't be fully implemented just by changes to PDBFixer. If you just need it for your own scripts, of course, it's fine to put a call to loadHydrogenDefinitions() at the start. But it will have limited value for other people.

sukritsingh commented 1 year ago

Ah! I see! Makes sense --- let me try using it and see if I can get it working using either the amber/phosaa14 forcefield or if I need to make a custom .xml file. Good to know at the very least that it is possible to use!

This does raise an issue: this feature can't be fully implemented just by changes to PDBFixer. If you just need it for your own scripts, of course, it's fine to put a call to loadHydrogenDefinitions() at the start. But it will have limited value for other people.

Hrmm the intention of this whole addition was to have it be useful for some other folks besides me as well. Also, it is rather nice to be able to auto-build in the heavy atoms of phosphorylated residues into protein chains rather than using other closed-source tools like Maestro where atom names/orders may be incompatible, so I see no reason not to at least be able to include these new residues.

How about a middle ground solution: Rather than modifying OpenMM perhaps we can just include an appropriate .xml file and some instructions in one of the directories (maybe just in pdbfixer) for how to include and use PTM residues? Happy to try creating/testing a set out and including a script + README.

jchodera commented 1 year ago

If I understand correctly, @sukritsingh's goal here was to permit users to use pdbfixer to introduce phosphorylations---a very common form of post-translational modification---via the mutation capability. To support that, we should be able to simply have pdbfixer load the appropriate hydrogen definitions as needed for the mutation residues supported.

Would pdbfixer write out the CONECT records for these residues since they are non-standard? If so, since amber/phosaa14 supports these residues, we wouldn't need to change anything in OpenMM---would we?

More generally, we might want to make pdbfixer aware of all the chemical components available in the PDB via their three-letter codes using the Chemical Component Dictionary, which we could automatically process and use to (1) fill in missing bonds, (2) fill in missing atoms, and (3) allow user-specified mutations to any of these known three-letter codes.

peastman commented 1 year ago

we should be able to simply have pdbfixer load the appropriate hydrogen definitions as needed for the mutation residues supported.

True, we could make make it automatically load its own definitions.

sukritsingh commented 1 year ago

Update: I've gotten it to to load the hydrogen definitions (I think) ok, but I'm encountering an issue about bonds being different?

Mutation from SER to SEP works when I run the following:

import pdbfixer.pdbfixer as pdbf
fixer = pdbf.PDBFixer("test-ser-noPhos.pdb")
fixer.applyMutations(["SER-173-SEP"], "A")
history
fixer.findMissingResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()

I then load in a sep-hydrogens.xml file that seems to load up without complaint:

from openmm import app
model = app.modeller.Modeller(fixer.topology, fixer.positions)
model.loadHydrogenDefinitions("sep-hydrogens.xml")
forcefield = app.ForceField('amber/ff14SB.xml', 'tip3p.xml', 'amber/phosaa14SB.xml')
model.addHydrogens(forcefield)

However, when I run the final model.addHydrogens(forcefield) line above, the following error occurs:

ValueError: No template found for residue 173 (SEP).  The set of atoms matches SEP, but the bonds are different.

Correct me if I'm wrong, but does this mean that the SEP listed in the amber/phosaa14SB.xml force field does not match the template I downloaded from online? If so should I just copy over the SEP listed there as my template? Edit: Actually that doesn't make sense right - I generated the soft.xml using that SEP template, why would it complain about bonds being different now?

sukritsingh commented 1 year ago

Ah! I think it's related to this comment above:

Would pdbfixer write out the CONECT records for these residues since they are non-standard? If so, since amber/phosaa14 supports these residues, we wouldn't need to change anything in OpenMM---would we?

TLDR: Any advice on what to change so PDBFixer writes out the full CONECT records?

In the example file I'm using for all this (attached below), if you you mutate SER->SEP using:

fixer = pdbf.PDBFixer("test-ser-noPhos.pdb")
fixer.applyMutations(["SER-173-SEP"], "A")
fixer.findMissingResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()

It does NOT write out the complete CONECT records for the residue. What it writes out is (found in mutate-ser-sep-wrong-conect.pdb:

CONECT 1392 1398
CONECT 1398 1392 1399
CONECT 1399 1400 1402 1398
CONECT 1400 1408 1399 1401 1408
CONECT 1401 1400
CONECT 1402 1399 1403
CONECT 1403 1402
CONECT 1408 1400 1400

What it should write out is (Edit: Found in mutate-ser-sep-correct-conect.pdb):

CONECT 1392 1398
CONECT 1398 1392 1399
CONECT 1399 1398 1400 1402
CONECT 1400 1399 1401 1408
CONECT 1401 1400
CONECT 1402 1399 1403
CONECT 1403 1402 1404
CONECT 1404 1403 1405 1406 1407
CONECT 1405 1404
CONECT 1406 1404
CONECT 1407 1404
CONECT 1408 1400

I think because it's non-standard it's only writing out the conect records for the standard part of the SEP residue.

On the bright side, as soon as you use the correct CONECT records, mutation + protonation becomes trivial (no need for a separate .xml file)

Any insight on what changes to make to allow the correct CONECT records to be written?

pdbfixer-ser-sep-test-files.zip

peastman commented 1 year ago

PDBFile decides which bonds to write based on the name of the residue:

for atom1, atom2 in topology.bonds():
    if atom1.residue.name not in PDBFile._standardResidues or atom2.residue.name not in PDBFile._standardResidues:
        conectBonds.append((atom1, atom2))

If those bonds aren't being written, that suggests they aren't present in the topology.

sukritsingh commented 1 year ago

Ah ok! Just to clarify the code snippet, it seems like anything outside the standard residue set (ie things involving HETATOMS) would be written out as CONECT records right? Based on what's being written it seems like the phosphate group's CONECT records aren't being written out, even though the phosphate atoms are being written out in the .pdb file....

Would the topology information be contained with the residue record for SEP in soft.xml? I'm looking at the residue record now for SEP and the bonds seem to connect correctly....

  <Residue name="SEP">
   <Atom name="N" type="N"/>
   <Atom name="CA" type="CX"/>
   <Atom name="CB" type="2C"/>
   <Atom name="OG" type="OZ"/>
   <Atom name="P" type="P"/>
   <Atom name="O1P" type="OX"/>
   <Atom name="O2P" type="OX"/>
   <Atom name="O3P" type="OX"/>
   <Atom name="C" type="C"/>
   <Atom name="O" type="O"/>
   <Bond from="0" to="1"/>
   <Bond from="1" to="8"/>
   <Bond from="1" to="2"/>
   <Bond from="2" to="3"/>
   <Bond from="3" to="4"/>
   <Bond from="4" to="5"/>
   <Bond from="4" to="7"/>
   <Bond from="4" to="6"/>
   <Bond from="8" to="9"/>
   <ExternalBond from="0"/>
   <ExternalBond from="8"/>
  </Residue>

Could there be confusion somehow because half the SEP topology is a standard residue, and then there's a few more atoms attached, or that the topology isn't being updated properly? Behaviorally, it's almost like it created CONECT records for residues that are standard, but ignored the bonds that occur at the post-translational modification...

peastman commented 1 year ago

There's no such thing as a "half standard residue". A residue is either a heterogen or it isn't. Here's what the spec says about CONECT records:

CONECT records are present for:

Intra-residue connectivity within non-standard (HET) residues (excluding water).

Inter-residue connectivity of HET groups to standard groups (including water) or to other HET groups.

Disulfide bridges specified in the SSBOND records have corresponding records.

Since SEP is not a standard amino acid, the entire residue is a heterogen and all bonds in it need to be listed. PDBFile decides what to write based on the residue name. Since 'SEP' is not contained in PDBFile._standardResidues, it's a heterogen and all bonds within it will be written out. I think the problem must be at an earlier point. The Topology being written to the file doesn't contain those bonds.

sukritsingh commented 1 year ago

Ahh ok thanks for clarifying! Sorry for all the questions - just trying to parse what's going on....

It looks like, just as you mentioned, the entire residue (pre-protonation) is being written out as heterogens (with HETATM labels)

HETATM 1398  N   SEP A 173     -66.676  -3.794 -15.914  1.00  0.00           N  
HETATM 1399  CA  SEP A 173     -68.120  -3.850 -16.128  1.00  0.00           C  
HETATM 1400  C   SEP A 173     -68.716  -2.462 -16.158  1.00  0.00           C  
HETATM 1401  O   SEP A 173     -69.923  -2.345 -16.568  1.00  0.00           O  
HETATM 1402  CB  SEP A 173     -68.801  -4.739 -15.058  1.00  0.00           C  
HETATM 1403  OG  SEP A 173     -68.721  -4.186 -13.740  1.00  0.00           O  
HETATM 1404  P   SEP A 173     -68.939  -5.268 -12.397  1.00  0.00           P  
HETATM 1405  O1P SEP A 173     -70.229  -5.944 -12.659  1.00  0.00           O  
HETATM 1406  O2P SEP A 173     -68.036  -6.189 -11.434  1.00  0.00           O  
HETATM 1407  O3P SEP A 173     -69.213  -3.851 -11.684  1.00  0.00           O

It looks like all the atoms do get added, but the bonds aren't rebuilt, which would be happening in _addAtomsToTopology right? I guess I should first check where it's drawing the bond information from - Does it draw the bond connectivity from the template PDB file? or soft.xml?

peastman commented 1 year ago

Does it draw the bond connectivity from the template PDB file?

It's from the template. That's the problem: your template doesn't have CONECT records. The existing templates don't need them because they're standard residues. SEP and PTR aren't, so they do need them.

sukritsingh commented 1 year ago

Ok great! Let me try adding the correct CONECT records into the template

Thanks so much for all your help!

sukritsingh commented 1 year ago

Update: I added in CONECT records to the template for SEP.pdb, including explicit CONECT records for the phosphate group:

HETATM 1398  N   SEP A   1     -66.676  -3.794 -15.914  1.00  0.00           N  
HETATM 1399  CA  SEP A   1     -68.120  -3.850 -16.128  1.00  0.00           C  
HETATM 1400  C   SEP A   1     -68.716  -2.462 -16.158  1.00  0.00           C  
HETATM 1401  O   SEP A   1     -69.923  -2.345 -16.568  1.00  0.00           O  
HETATM 1402  CB  SEP A   1     -68.801  -4.739 -15.058  1.00  0.00           C  
HETATM 1403  OG  SEP A   1     -68.721  -4.186 -13.740  1.00  0.00           O  
HETATM 1404  P   SEP A   1     -68.939  -5.268 -12.397  1.00  0.00           P  
HETATM 1405  O1P SEP A   1     -70.229  -5.944 -12.659  1.00  0.00           O  
HETATM 1406  O2P SEP A   1     -68.036  -6.189 -11.434  1.00  0.00           O  
HETATM 1407  O3P SEP A   1     -69.213  -3.851 -11.684  1.00  0.00           O  
TER    1408      SEP A   1
CONECT 1398 1392 1399
CONECT 1399 1398 1400 1402
CONECT 1400 1399 1401 1408
CONECT 1401 1400
CONECT 1402 1399 1403
CONECT 1403 1402 1404
CONECT 1404 1403 1405 1406 1407
CONECT 1405 1404
CONECT 1406 1404
CONECT 1407 1404
END

and it looks good when I open it up in pymol:

However, if I run the following commands to introduce a mutation:

import pdbfixer.pdbfixer as pdbf
fixer = pdbf.PDBFixer("../sep-testing/test-ser-noPhos.pdb")
fixer.applyMutations(["SER-173-SEP"], "A")
fixer.findMissingResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()
from openmm import app
app.PDBFile.writeFile(fixer.topology, fixer.positions, open("./test-mutate.pdb", "w"))

then the same behavior shows up where the atoms are added in but the bonds are for the phosphate aren't written out:

I felt like I'm missing some simple thing? I'm trying to run through addMissingAtoms and _addAtomsToTopology to see what I may be missing that's causing this.

peastman commented 1 year ago

You can print out all the bonds involving the residue with something like this:

print([(a1, a2) for a1, a2 in fixer.topology.bonds() if 'SEP' in [a1.residue.name, a2.residue.name]])

Print that at each point in the script and see whether the residue contains the bonds you expect it to at that point.

sukritsingh commented 1 year ago

Finally returning to this (so sorry for the delay - I got buried in K99 grant writing right after that last post and it was my first time writing an NIH grant!)

So I went through and ran through the above commands line by line and then ran: print([(a1, a2) for a1, a2 in fixer.topology.bonds() if 'SEP' in [a1.residue.name, a2.residue.name]])

to write out the bonds after each step. The problem bonds are missing occurred right after I run:

fixer.applyMutations(["SER-173-SEP"], "A")

I've pasted out the bonds below for completeness, but to summarize, it's only got the bonds of the original SER residue that was in that position. It does not seem to have any record of the added bonds when I run applyMutations, even though the template for SEP (pasted above) contains the appropriate CONECT records.

Should the bonds be recognized right after applying the mutation? I had assumed so but in looking at applyMutations now within pdbfixer.pyall it does remap the residues based on the mutation list, and then deletes unneeded atoms. I then figured that bonds would be added in fixer.addMissingAtoms, but nothing appears to alter the output of the print statement to append the appropriate bonds.

Any advice as to what to next try?

Bonds output from the print statement pasted above (same across all lines after running applyMutations):

[(<Atom 1391 (C) of chain 0 residue 171 (LEU)>, <Atom 1397 (N) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1398 (CA) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1400 (O) of chain 0 residue 172 (SEP)>), (<Atom 1398 (CA) of chain 0 residue 172 (SEP)>, <Atom 1401 (CB) of chain 0 residue 172 (SEP)>), (<Atom 1398 (CA) of chain 0 residue 172 (SEP)>, <Atom 1397 (N) of chain 0 residue 172 (SEP)>), (<Atom 1401 (CB) of chain 0 residue 172 (SEP)>, <Atom 1402 (OG) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1403 (N) of chain 0 residue 173 (GLN)>)]

This is the bonds output from running addMissingAtoms() - it also does not add any of the sidechain bonds, but weirdly shuffles around the order in which the bonds are printed?

[(<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1407 (N) of chain 0 residue 173 (GLN)>), (<Atom 1391 (C) of chain 0 residue 171 (LEU)>, <Atom 1397 (N) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1398 (CA) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1400 (O) of chain 0 residue 172 (SEP)>), (<Atom 1398 (CA) of chain 0 residue 172 (SEP)>, <Atom 1401 (CB) of chain 0 residue 172 (SEP)>), (<Atom 1398 (CA) of chain 0 residue 172 (SEP)>, <Atom 1397 (N) of chain 0 residue 172 (SEP)>), (<Atom 1401 (CB) of chain 0 residue 172 (SEP)>, <Atom 1402 (OG) of chain 0 residue 172 (SEP)>), (<Atom 1399 (C) of chain 0 residue 172 (SEP)>, <Atom 1407 (N) of chain 0 residue 173 (GLN)>)]

sukritsingh commented 1 year ago

Small update: Dug around a bit more and noticed that if I run: newTopology, newPositions, newAtoms, existingAtomMap = fixer._addAtomsToTopology(True, True)

Then the newToplogy object also does not contain the SEP bonds, even though I can verify the bonds are in the templates using

template = fixer.templates["SEP"]
print([(a1, a2) for a1, a2 in template.topology.bonds() if 'SEP' in [a1.residue.name, a2.residue.name]])

Interestingly, I also noticed that the existingAtomMap created by _addAtomToTopology does not contain any Atoms mapped from the template (although I would expect them to based on the docstring)

The section for residue SEP in existingAtomMap looks like:

<Atom 1397 (N) of chain 0 residue 172 (SEP)>: <Atom 1397 (N) of chain 0 residue 172 (SEP)>, 
<Atom 1398 (CA) of chain 0 residue 172 (SEP)>: <Atom 1398 (CA) of chain 0 residue 172 (SEP)>, 
<Atom 1399 (C) of chain 0 residue 172 (SEP)>: <Atom 1399 (C) of chain 0 residue 172 (SEP)>, 
<Atom 1400 (O) of chain 0 residue 172 (SEP)>: <Atom 1400 (O) of chain 0 residue 172 (SEP)>, 
<Atom 1401 (CB) of chain 0 residue 172 (SEP)>: <Atom 1401 (CB) of chain 0 residue 172 (SEP)>, 
<Atom 1402 (OG) of chain 0 residue 172 (SEP)>: <Atom 1402 (OG) of chain 0 residue 172 (SEP)>, 
<Atom 1403 (N) of chain 0 residue 173 (GLN)>: <Atom 1407 (N) of chain 0 residue 173 (GLN)>,

with no mention of the new atoms. Given that existingAtomMap is used for the creation of new bonds in _addAtomsToTopology - could this be why it's never building the CONECT records?

openmm / pdbfixer

Add phosphorylated residues SEP and PTR (pSer and pTyr) #261