More rigorous implementation of findMissingResidues needed (e.g., doesn't work for antibodies)

nitroamos commented 6 years ago

While writing up my response to https://github.com/pandegroup/openmm/issues/2087 I went poking around in findMissingResidues to figure out how it's been implemented, and I found a problem.

First, to recap that discussion: 1) You can't rely on residue numbers alone to determine where gaps in the structures are. Residues numbers are often chosen on the basis of an alignment with a family of proteins and as such can jump up or down. If you want to take an approach like this, you should use r.index instead of r.id. Example.

2) You can't rely on SEQRES to find gaps in the structure. For example, if SEQRES says the sequence is AAA and the coordinate section says AA, there are 3 places you could put the missing A.

3) I'm not a PDB historian so I don't know why it's in a REMARK section, but REMARK 465 is supposed to be where you look to find where the missing residues are and given the template they provide, it is intended to be machine readable.

I went looking in the code for findMissingResidues to understand it. As far as I can tell, it makes these assumptions: 1) it does not use REMARK 465. This means it could be fooled by SEQRES resulting in misthreaded structures. 2) By not using REMARK 465, it doesn't know if the missing residues are already named (i.e., id and insertionCode) 3) it assumes residue numbers are sequential 4) It doesn't provide any warnings if something looks suspicious.

Take a look at this line:

        for chain in chains:
            minResidue = min(int(r.id) for r in chain.residues())
            maxResidue = max(int(r.id) for r in chain.residues())
            residues = [None]*(maxResidue-minResidue+1)
            for r in chain.residues():
                residues[int(r.id)-minResidue] = r.name
            chainWithGaps[chain] = residues

the key used for residues int(r.id)-minResidue is not the correct index into SEQRES when the structure has insertion codes or when jumps in residue numbers are intended. For example, this code won't work for any antibody structures where the complementarity determining regions (CDRs) all have insertion codes. Take for example 1hzh, one of the PDB-101 Molecule of the Month structures.

In its chain K, it has many missing residues:

REMARK 465                                                                      
REMARK 465 MISSING RESIDUES                                                     
REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE                       
REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.)                
REMARK 465                                                                      
REMARK 465   M RES C SSSEQI                                                     
REMARK 465     SER K   130                                                      
REMARK 465     LYS K   131                                                      
REMARK 465     SER K   132                                                      
REMARK 465     THR K   133                                                      
REMARK 465     SER K   134                                                      
REMARK 465     GLY K   135                                                      
REMARK 465     GLY K   136                                                      
REMARK 465     THR K   236                                                      
REMARK 465     HIS K   237                                                      
REMARK 465     THR K   238                                                      
REMARK 465     PRO K   476                                                      
REMARK 465     GLY K   477                                                      
REMARK 465     LYS K   478

ATOM   4662  OG  SER K 127      83.919 104.754 131.307  1.00132.00           O  
ATOM   4663  N   THR K 137      74.196 104.907 127.378  1.00142.90           N

ATOM   5282  NZ  LYS K 235      76.314  95.310 146.125  1.00150.47           N  
ATOM   5283  N   CYS K 239      69.797 106.841 141.221  1.00162.32           N

ATOM   7020  OG  SER K 475      12.250 142.463 112.137  1.00198.77           O  
TER    7021      SER K 475

but findMissingResidues doesn't find any of them.

I wouldn't know for sure until trying, but it seems like addressing this could work like this if the PDB has a REMARK 465 section: 1) start saving the REMARK 465 section of the PDB file when read in 2) the number of residues in the structure + number of missing residues should be equal to the length of SEQRES 3) insert None into the structure's sequence at the index specified by SSSEQ. At least in this example, it looks like SSSEQ is indexed in residue number space. Perhaps we must assume sane residue numbering near these gaps. In this example, it looks like we'll still be missing residues K128 and K129. 4) it looks like SSSEQ and the provided insertion code are supposed to be used in the new residues. 5) compare the structure's sequence with SEQRES and throw an exception or at least a warning if they don't match.

If it doesn't have a REMARK 465 section, then at the very least, findMissingResidues should be refactored to use r.index (i.e., the index in the chain) rather than r.id.

nitroamos commented 6 years ago

Ooops, just saw someone else wrote up something similar https://github.com/pandegroup/pdbfixer/issues/154

peastman commented 6 years ago

Thanks, that's a good writeup. There's one more important complication to add: chain IDs are usually not unique. Pick any PDB file, and you'll probably find it has more than one chain with the same ID. So you can't rely on them for matching up ATOM records to SEQRES records or REMARKs. Currently it does a sequence alignment to try to guess which chain each record actually corresponds to.

nitroamos commented 6 years ago

Hi @peastman , here's my implementation. It looks like the easiest approach for me was just to modify pdbfixer.py like this:

572c541
<     def findMissingResidues(self,chainWithGapsOverride=None):
---
>     def findMissingResidues(self):
593,596d557
<             if chainWithGapsOverride and chain.id in chainWithGapsOverride:
<                 chainWithGaps[chain] = chainWithGapsOverride[chain.id]
<                 continue

so that basically, I could substitute my own value for chainWithGaps with PDBFixer's. I compute it in my own code like this:

def getChainWithGapsFromRemark465(filename,fixer,summary):
   if not filename.endswith(".pdb"):
      return None
   remark465 = None
   with open(filename,'r') as f:
      for l in f:
         if not l.startswith("REMARK 465"): continue
         if "M RES C SSSEQI" in l:
            remark465 = {}
            continue
         if remark465 is None: continue
         m = l[13]
         res = l[15:18]
         c = l[19]
         ssseq = l[21:26]
         i = l[26]
         #print(l,end='')
         #print(["%s"%e for e in [m,res,c,ssseq,i]])
         if c not in remark465:
            remark465[c] = []
         remark465[c].append({
            "res":res,
            "ssseq":int(ssseq),
            "i":i
         })

   if not remark465:
      summary["usingRemark465"] = False
      return None

   summary["usingRemark465"] = True
   chains = [c for c in fixer.topology.chains() if len(list(c.residues())) > 0]
   chainWithGaps = {}

   for chain in chains:
      sequence = None
      for s in fixer.sequences:
         if chain.id == s.chainId:
            sequence = s
            break

      if sequence and chain.id in remark465:
         missing = remark465[chain.id]
         summary[chain.id] = {}
         summary[chain.id]["numMissing"] = len(missing)
         summary[chain.id]["numResidues"] = len([r for r in chain.residues()])

         print("numMissing:",len(missing),"numRes:",len([r for r in chain.residues()]))
         idx = 0
         residues = []
         for r in chain.residues():
            hasIC = r.insertionCode not in ['',None]
            while idx < len(missing) and missing[idx]["ssseq"] < int(r.id) and \
                  (not hasIC or missing[idx]["i"] <= r.insertionCode):
               print("missing:",idx,missing[idx])
               residues.append(None)
               idx += 1
            print("residue:",r.id,r.name,"'%s'"%r.insertionCode,r)
            if idx < len(missing) and \
               missing[idx]["ssseq"] == int(r.id) and (not hasIC or missing[idx]["i"] == r.insertionCode):
               #residue isn't actually missing
               idx += 1
            residues.append(r.name)
         residues.extend([None]*(len(missing)-idx))

         removeRestypes = []
         for restype in set(residues):
            if restype is None: continue
            if restype not in sequence.residues:
               removeRestypes.append(restype)
         print("Removing restypes",removeRestypes,"not found in SEQRES (assuming they're ligands)")
         residues = [res for res in residues if res not in removeRestypes]

         if len(residues) == len(sequence.residues):
            chainWithGaps[chain.id] = residues
         else:
            summary["hasWarning"] = True
            print("Length mismatch ATOM+REMARK 465 vs SEQRES! %i vs %i"%(len(residues),len(sequence.residues)))

         print("chainWithGaps:",chain,len(chainWithGaps[chain.id]),len(sequence.residues),chainWithGaps[chain.id])
      else:
         chainWithGaps[chain.id] = []

   return chainWithGaps

It makes the following assumptions:

I didn't work out a way to name the new residues with the residues based on the resnum and insertion code provided by REMARK 465. That would be nice since in one example I tested with, the missing residues include a jump in resnums.
It assumes that the missing residues can be placed by comparing resnums. This assumes that resnums are monotonic in both and that the merged list of resnums is also monotonic.
If a restype in the chain is not in SEQRES, then it shouldn't be present during the alignment in findMissingResidues. E.g., ACE/NME residues. I don't remember but based on my comment it seems like I was getting ligands mixed in with my residues as well.
It allows repeat calls by not assuming everything in REMARK 465 is actually missing. In my case, I use this to build missing residues at one point in my workflow and in a later point use this same code to add my ACE/NME caps.

If you'd support it, I could prepare this as a Pull Request. Regardless, if you're interested at least in some suggestions:

It would be nice if OpenMM preserved PDB headers somewhere so they could be referenced for purposes like this.
If you don't want PDBFixer to read REMARK 465, you might at least consider making the change I did to findMissingResidues to enable users to use different assumptions that PDBFixer's implementation.

nitroamos commented 5 years ago

Just found another bug. findMissingResidues assumes here that SEQRES includes everything the PDB does, which isn't usually the case for waters/ligands.

zhang-ivy commented 5 years ago

I am also experiencing issues with how PDBFixer's findMissingResidues handles inserted residues in antibodies. I am working with 5UDC (https://www.rcsb.org/structure/5UDC), and findMissingResidues fails to return missing residues in chains with inserted residues.

For example, chain H has the following missing residues:

REMARK 465     GLN H     1                                                      
REMARK 465     SER H   128                                                      
REMARK 465     LYS H   129                                                      
REMARK 465     SER H   130                                                      
REMARK 465     THR H   131                                                      
REMARK 465     SER H   132                                                      
REMARK 465     GLY H   133                                                      
REMARK 465     GLU H   212                                                      
REMARK 465     PRO H   213                                                      
REMARK 465     LYS H   214                                                      
REMARK 465     SER H   215

But none of them are returned by findMissingResidues.

Can we fix this bug? Using @nitroamos's code, I am able to identify the correct missing residues, but a bit more work is needed for the missing residues to actually be added. Also, it doesn't seem like his changes #178 were ever merged.

@jchodera

peastman commented 5 years ago

Thanks! I can reproduce the problem. I'm looking into it now.

peastman commented 5 years ago

This looks to me like an error in the PDB file. Either that, or I don't understand what's going on. Here's the relevant series of residues:

ATOM    360  N   ILE H  51     -57.342  15.070 -55.184  1.00 92.79           N  
ANISOU  360  N   ILE H  51     7474  14291  13492   -662   1084  -4079       N  
ATOM    361  CA  ILE H  51     -57.391  15.591 -56.537  1.00 94.26           C  
ANISOU  361  CA  ILE H  51     7542  14987  13286   -833   1075  -4126       C  
ATOM    362  C   ILE H  51     -56.782  16.981 -56.528  1.00 94.34           C  
ANISOU  362  C   ILE H  51     7484  15201  13161   -826    981  -3727       C  
ATOM    363  O   ILE H  51     -56.991  17.756 -55.590  1.00 96.96           O  
ANISOU  363  O   ILE H  51     7891  15306  13644   -762    862  -3411       O  
ATOM    364  CB  ILE H  51     -58.842  15.613 -57.059  1.00 96.53           C  
ANISOU  364  CB  ILE H  51     7825  15433  13419  -1045    930  -4143       C  
ATOM    365  CG1 ILE H  51     -58.888  16.103 -58.502  1.00108.14           C  
ANISOU  365  CG1 ILE H  51     9198  17473  14417  -1282    861  -4148       C  
ATOM    366  CG2 ILE H  51     -59.736  16.433 -56.145  1.00 87.25           C  
ANISOU  366  CG2 ILE H  51     6662  14055  12436   -998    782  -3764       C  
ATOM    367  CD1 ILE H  51     -60.180  15.759 -59.183  1.00115.85           C  
ANISOU  367  CD1 ILE H  51    10151  18641  15227  -1544    692  -4253       C  
ATOM    368  N   ILE H  52     -56.001  17.289 -57.550  1.00 91.02           N  
ANISOU  368  N   ILE H  52     6940  15190  12452   -913   1064  -3757       N  
ATOM    369  CA  ILE H  52     -55.369  18.601 -57.613  1.00 96.97           C  
ANISOU  369  CA  ILE H  52     7621  16136  13088   -953    972  -3355       C  
ATOM    370  C   ILE H  52     -55.852  19.268 -58.893  1.00107.77           C  
ANISOU  370  C   ILE H  52     8925  18004  14017  -1206    878  -3224       C  
ATOM    371  O   ILE H  52     -55.136  19.263 -59.905  1.00114.67           O  
ANISOU  371  O   ILE H  52     9706  19301  14563  -1340   1021  -3319       O  
ATOM    372  CB  ILE H  52     -53.842  18.479 -57.576  1.00 96.28           C  
ANISOU  372  CB  ILE H  52     7395  16101  13087   -855   1146  -3388       C  
ATOM    373  CG1 ILE H  52     -53.433  17.543 -56.448  1.00 87.14           C  
ANISOU  373  CG1 ILE H  52     6274  14467  12367   -621   1198  -3530       C  
ATOM    374  CG2 ILE H  52     -53.232  19.829 -57.330  1.00 96.00           C  
ANISOU  374  CG2 ILE H  52     7311  16140  13024   -918   1002  -2934       C  
ATOM    375  CD1 ILE H  52     -51.993  17.274 -56.428  1.00 89.11           C  
ANISOU  375  CD1 ILE H  52     6306  14750  12801   -496   1352  -3555       C  
ATOM    376  N   PRO H  52A    -57.043  19.865 -58.886  1.00109.86           N  
ANISOU  376  N   PRO H  52A    9220  18252  14271  -1285    644  -2976       N  
ATOM    377  CA  PRO H  52A    -57.650  20.302 -60.151  1.00113.75           C  
ANISOU  377  CA  PRO H  52A    9638  19229  14353  -1555    483  -2831       C  
ATOM    378  C   PRO H  52A    -56.789  21.255 -60.948  1.00116.27           C  
ANISOU  378  C   PRO H  52A    9884  19950  14343  -1714    472  -2531       C  
ATOM    379  O   PRO H  52A    -56.854  21.242 -62.185  1.00120.73           O  
ANISOU  379  O   PRO H  52A   10418  21033  14423  -1990    447  -2548       O  
ATOM    380  CB  PRO H  52A    -58.951  20.972 -59.690  1.00113.15           C  
ANISOU  380  CB  PRO H  52A    9534  18935  14524  -1518    221  -2491       C  
ATOM    381  CG  PRO H  52A    -59.292  20.278 -58.435  1.00109.67           C  
ANISOU  381  CG  PRO H  52A    9186  17982  14502  -1297    336  -2707       C  
ATOM    382  CD  PRO H  52A    -57.972  20.008 -57.758  1.00109.59           C  
ANISOU  382  CD  PRO H  52A    9268  17766  14606  -1142    538  -2851       C  
ATOM    383  N   VAL H  53     -55.979  22.081 -60.283  1.00110.59           N  
ANISOU  383  N   VAL H  53     9154  19024  13840  -1597    483  -2248       N  
ATOM    384  CA  VAL H  53     -55.158  23.027 -61.031  1.00104.28           C  
ANISOU  384  CA  VAL H  53     8270  18595  12756  -1788    468  -1915       C  
ATOM    385  C   VAL H  53     -54.109  22.305 -61.857  1.00105.93           C  
ANISOU  385  C   VAL H  53     8390  19227  12630  -1896    790  -2251       C  
ATOM    386  O   VAL H  53     -53.665  22.823 -62.884  1.00111.98           O  
ANISOU  386  O   VAL H  53     9087  20488  12971  -2156    829  -2062       O  
ATOM    387  CB  VAL H  53     -54.534  24.075 -60.092  1.00100.97           C  
ANISOU  387  CB  VAL H  53     7872  17818  12673  -1685    387  -1551       C  
ATOM    388  CG1 VAL H  53     -53.433  23.469 -59.278  1.00 97.99           C  
ANISOU  388  CG1 VAL H  53     7469  17223  12538  -1518    591  -1802       C  
ATOM    389  CG2 VAL H  53     -54.001  25.245 -60.898  1.00 95.05           C  
ANISOU  389  CG2 VAL H  53     7042  17404  11667  -1936    288  -1086       C

Notice there are two alternatives for residue 52. So the sequence of this span could be either ILE ILE VAL or else ILE PRO VAL. Now look at the corresponding SEQRES records. These residues span the end of line 4 and the start of line 5:

SEQRES   4 H  228  ALA PRO GLY GLN GLY PRO GLU TRP MET GLY GLY ILE ILE          
SEQRES   5 H  228  PRO VAL LEU GLY THR VAL HIS TYR GLY PRO LYS PHE GLN

Here it lists the sequence as ILE ILE PRO VAL. It includes both alternatives for residue 52 as if they were two separate residues in the sequence. This leads PDBFixer to conclude the sequences don't match and it doesn't try to add missing residues for that chain.

zhang-ivy commented 5 years ago

Thanks for looking into this! I was interpreting 52A as an insertion, not an alternative to residue 52. Would it be possible to implement changes in PDBFixer that handles insertion codes better, i.e. renumber the residues if insertion codes exist and then output a dictionary mapping old to new residue numbers? Or instead of a dictionary, just returning a PDB with the same residue numbers as the original but with missing residues added?

peastman commented 5 years ago

Ok, I see. So given the set of residues for which there are ATOM records, and the residue indices and insertion codes for them, how do I determine the index into the SEQRES sequence that each one corresponds to? The documentation is very unclear about it. I assume the correct algorithm is something like this: for every residue that has an insertion code, add 1 to the index of that residue and every other residue that comes after it?

zhang-ivy commented 5 years ago

Yes, I think you would add 1 to the index of the residue with the insertion code and every other residue after it, but I think you also need to maintain the gaps in subsequent residue numbers.

I used the following code to renumber the residues in mdtraj before loading into PDBFixer:

for chain in traj.topology.chains:
    previous_res_cur = 0
    previous_res_old = 0
    for res in chain.residues:
        res_old = res.resSeq
        if res.resSeq == previous_res_cur:
            res.resSeq += 1
        elif res.resSeq == previous_res_old:
            res.resSeq = previous_res_cur + 1
        elif res.resSeq < previous_res_cur:
            res.resSeq = previous_res_cur + (res_old - previous_res_old)
        previous_res_cur = res.resSeq
        previous_res_old = res_old

peastman commented 5 years ago

@ivyzhang999 the fix for your issue is in #184. Could you try it and make sure it works for you?

zhang-ivy commented 5 years ago

@peastman The fix works for me, thanks!

peastman commented 5 years ago

Thanks! I've merged it.

openmm / pdbfixer

More rigorous implementation of findMissingResidues needed (e.g., doesn't work for antibodies) #170