Open nitroamos opened 6 years ago
Ooops, just saw someone else wrote up something similar https://github.com/pandegroup/pdbfixer/issues/154
Thanks, that's a good writeup. There's one more important complication to add: chain IDs are usually not unique. Pick any PDB file, and you'll probably find it has more than one chain with the same ID. So you can't rely on them for matching up ATOM records to SEQRES records or REMARKs. Currently it does a sequence alignment to try to guess which chain each record actually corresponds to.
Hi @peastman , here's my implementation. It looks like the easiest approach for me was just to modify pdbfixer.py
like this:
572c541
< def findMissingResidues(self,chainWithGapsOverride=None):
---
> def findMissingResidues(self):
593,596d557
< if chainWithGapsOverride and chain.id in chainWithGapsOverride:
< chainWithGaps[chain] = chainWithGapsOverride[chain.id]
< continue
so that basically, I could substitute my own value for chainWithGaps
with PDBFixer's. I compute it in my own code like this:
def getChainWithGapsFromRemark465(filename,fixer,summary):
if not filename.endswith(".pdb"):
return None
remark465 = None
with open(filename,'r') as f:
for l in f:
if not l.startswith("REMARK 465"): continue
if "M RES C SSSEQI" in l:
remark465 = {}
continue
if remark465 is None: continue
m = l[13]
res = l[15:18]
c = l[19]
ssseq = l[21:26]
i = l[26]
#print(l,end='')
#print(["%s"%e for e in [m,res,c,ssseq,i]])
if c not in remark465:
remark465[c] = []
remark465[c].append({
"res":res,
"ssseq":int(ssseq),
"i":i
})
if not remark465:
summary["usingRemark465"] = False
return None
summary["usingRemark465"] = True
chains = [c for c in fixer.topology.chains() if len(list(c.residues())) > 0]
chainWithGaps = {}
for chain in chains:
sequence = None
for s in fixer.sequences:
if chain.id == s.chainId:
sequence = s
break
if sequence and chain.id in remark465:
missing = remark465[chain.id]
summary[chain.id] = {}
summary[chain.id]["numMissing"] = len(missing)
summary[chain.id]["numResidues"] = len([r for r in chain.residues()])
print("numMissing:",len(missing),"numRes:",len([r for r in chain.residues()]))
idx = 0
residues = []
for r in chain.residues():
hasIC = r.insertionCode not in ['',None]
while idx < len(missing) and missing[idx]["ssseq"] < int(r.id) and \
(not hasIC or missing[idx]["i"] <= r.insertionCode):
print("missing:",idx,missing[idx])
residues.append(None)
idx += 1
print("residue:",r.id,r.name,"'%s'"%r.insertionCode,r)
if idx < len(missing) and \
missing[idx]["ssseq"] == int(r.id) and (not hasIC or missing[idx]["i"] == r.insertionCode):
#residue isn't actually missing
idx += 1
residues.append(r.name)
residues.extend([None]*(len(missing)-idx))
removeRestypes = []
for restype in set(residues):
if restype is None: continue
if restype not in sequence.residues:
removeRestypes.append(restype)
print("Removing restypes",removeRestypes,"not found in SEQRES (assuming they're ligands)")
residues = [res for res in residues if res not in removeRestypes]
if len(residues) == len(sequence.residues):
chainWithGaps[chain.id] = residues
else:
summary["hasWarning"] = True
print("Length mismatch ATOM+REMARK 465 vs SEQRES! %i vs %i"%(len(residues),len(sequence.residues)))
print("chainWithGaps:",chain,len(chainWithGaps[chain.id]),len(sequence.residues),chainWithGaps[chain.id])
else:
chainWithGaps[chain.id] = []
return chainWithGaps
It makes the following assumptions:
findMissingResidues
. E.g., ACE/NME residues. I don't remember but based on my comment it seems like I was getting ligands mixed in with my residues as well.If you'd support it, I could prepare this as a Pull Request. Regardless, if you're interested at least in some suggestions:
findMissingResidues
to enable users to use different assumptions that PDBFixer's implementation.Just found another bug. findMissingResidues
assumes here that SEQRES
includes everything the PDB does, which isn't usually the case for waters/ligands.
I am also experiencing issues with how PDBFixer's findMissingResidues
handles inserted residues in antibodies. I am working with 5UDC (https://www.rcsb.org/structure/5UDC), and findMissingResidues
fails to return missing residues in chains with inserted residues.
For example, chain H has the following missing residues:
REMARK 465 GLN H 1
REMARK 465 SER H 128
REMARK 465 LYS H 129
REMARK 465 SER H 130
REMARK 465 THR H 131
REMARK 465 SER H 132
REMARK 465 GLY H 133
REMARK 465 GLU H 212
REMARK 465 PRO H 213
REMARK 465 LYS H 214
REMARK 465 SER H 215
But none of them are returned by findMissingResidues
.
Can we fix this bug? Using @nitroamos's code, I am able to identify the correct missing residues, but a bit more work is needed for the missing residues to actually be added. Also, it doesn't seem like his changes #178 were ever merged.
@jchodera
Thanks! I can reproduce the problem. I'm looking into it now.
This looks to me like an error in the PDB file. Either that, or I don't understand what's going on. Here's the relevant series of residues:
ATOM 360 N ILE H 51 -57.342 15.070 -55.184 1.00 92.79 N
ANISOU 360 N ILE H 51 7474 14291 13492 -662 1084 -4079 N
ATOM 361 CA ILE H 51 -57.391 15.591 -56.537 1.00 94.26 C
ANISOU 361 CA ILE H 51 7542 14987 13286 -833 1075 -4126 C
ATOM 362 C ILE H 51 -56.782 16.981 -56.528 1.00 94.34 C
ANISOU 362 C ILE H 51 7484 15201 13161 -826 981 -3727 C
ATOM 363 O ILE H 51 -56.991 17.756 -55.590 1.00 96.96 O
ANISOU 363 O ILE H 51 7891 15306 13644 -762 862 -3411 O
ATOM 364 CB ILE H 51 -58.842 15.613 -57.059 1.00 96.53 C
ANISOU 364 CB ILE H 51 7825 15433 13419 -1045 930 -4143 C
ATOM 365 CG1 ILE H 51 -58.888 16.103 -58.502 1.00108.14 C
ANISOU 365 CG1 ILE H 51 9198 17473 14417 -1282 861 -4148 C
ATOM 366 CG2 ILE H 51 -59.736 16.433 -56.145 1.00 87.25 C
ANISOU 366 CG2 ILE H 51 6662 14055 12436 -998 782 -3764 C
ATOM 367 CD1 ILE H 51 -60.180 15.759 -59.183 1.00115.85 C
ANISOU 367 CD1 ILE H 51 10151 18641 15227 -1544 692 -4253 C
ATOM 368 N ILE H 52 -56.001 17.289 -57.550 1.00 91.02 N
ANISOU 368 N ILE H 52 6940 15190 12452 -913 1064 -3757 N
ATOM 369 CA ILE H 52 -55.369 18.601 -57.613 1.00 96.97 C
ANISOU 369 CA ILE H 52 7621 16136 13088 -953 972 -3355 C
ATOM 370 C ILE H 52 -55.852 19.268 -58.893 1.00107.77 C
ANISOU 370 C ILE H 52 8925 18004 14017 -1206 878 -3224 C
ATOM 371 O ILE H 52 -55.136 19.263 -59.905 1.00114.67 O
ANISOU 371 O ILE H 52 9706 19301 14563 -1340 1021 -3319 O
ATOM 372 CB ILE H 52 -53.842 18.479 -57.576 1.00 96.28 C
ANISOU 372 CB ILE H 52 7395 16101 13087 -855 1146 -3388 C
ATOM 373 CG1 ILE H 52 -53.433 17.543 -56.448 1.00 87.14 C
ANISOU 373 CG1 ILE H 52 6274 14467 12367 -621 1198 -3530 C
ATOM 374 CG2 ILE H 52 -53.232 19.829 -57.330 1.00 96.00 C
ANISOU 374 CG2 ILE H 52 7311 16140 13024 -918 1002 -2934 C
ATOM 375 CD1 ILE H 52 -51.993 17.274 -56.428 1.00 89.11 C
ANISOU 375 CD1 ILE H 52 6306 14750 12801 -496 1352 -3555 C
ATOM 376 N PRO H 52A -57.043 19.865 -58.886 1.00109.86 N
ANISOU 376 N PRO H 52A 9220 18252 14271 -1285 644 -2976 N
ATOM 377 CA PRO H 52A -57.650 20.302 -60.151 1.00113.75 C
ANISOU 377 CA PRO H 52A 9638 19229 14353 -1555 483 -2831 C
ATOM 378 C PRO H 52A -56.789 21.255 -60.948 1.00116.27 C
ANISOU 378 C PRO H 52A 9884 19950 14343 -1714 472 -2531 C
ATOM 379 O PRO H 52A -56.854 21.242 -62.185 1.00120.73 O
ANISOU 379 O PRO H 52A 10418 21033 14423 -1990 447 -2548 O
ATOM 380 CB PRO H 52A -58.951 20.972 -59.690 1.00113.15 C
ANISOU 380 CB PRO H 52A 9534 18935 14524 -1518 221 -2491 C
ATOM 381 CG PRO H 52A -59.292 20.278 -58.435 1.00109.67 C
ANISOU 381 CG PRO H 52A 9186 17982 14502 -1297 336 -2707 C
ATOM 382 CD PRO H 52A -57.972 20.008 -57.758 1.00109.59 C
ANISOU 382 CD PRO H 52A 9268 17766 14606 -1142 538 -2851 C
ATOM 383 N VAL H 53 -55.979 22.081 -60.283 1.00110.59 N
ANISOU 383 N VAL H 53 9154 19024 13840 -1597 483 -2248 N
ATOM 384 CA VAL H 53 -55.158 23.027 -61.031 1.00104.28 C
ANISOU 384 CA VAL H 53 8270 18595 12756 -1788 468 -1915 C
ATOM 385 C VAL H 53 -54.109 22.305 -61.857 1.00105.93 C
ANISOU 385 C VAL H 53 8390 19227 12630 -1896 790 -2251 C
ATOM 386 O VAL H 53 -53.665 22.823 -62.884 1.00111.98 O
ANISOU 386 O VAL H 53 9087 20488 12971 -2156 829 -2062 O
ATOM 387 CB VAL H 53 -54.534 24.075 -60.092 1.00100.97 C
ANISOU 387 CB VAL H 53 7872 17818 12673 -1685 387 -1551 C
ATOM 388 CG1 VAL H 53 -53.433 23.469 -59.278 1.00 97.99 C
ANISOU 388 CG1 VAL H 53 7469 17223 12538 -1518 591 -1802 C
ATOM 389 CG2 VAL H 53 -54.001 25.245 -60.898 1.00 95.05 C
ANISOU 389 CG2 VAL H 53 7042 17404 11667 -1936 288 -1086 C
Notice there are two alternatives for residue 52. So the sequence of this span could be either ILE ILE VAL or else ILE PRO VAL. Now look at the corresponding SEQRES records. These residues span the end of line 4 and the start of line 5:
SEQRES 4 H 228 ALA PRO GLY GLN GLY PRO GLU TRP MET GLY GLY ILE ILE
SEQRES 5 H 228 PRO VAL LEU GLY THR VAL HIS TYR GLY PRO LYS PHE GLN
Here it lists the sequence as ILE ILE PRO VAL. It includes both alternatives for residue 52 as if they were two separate residues in the sequence. This leads PDBFixer to conclude the sequences don't match and it doesn't try to add missing residues for that chain.
Thanks for looking into this! I was interpreting 52A as an insertion, not an alternative to residue 52. Would it be possible to implement changes in PDBFixer that handles insertion codes better, i.e. renumber the residues if insertion codes exist and then output a dictionary mapping old to new residue numbers? Or instead of a dictionary, just returning a PDB with the same residue numbers as the original but with missing residues added?
Ok, I see. So given the set of residues for which there are ATOM records, and the residue indices and insertion codes for them, how do I determine the index into the SEQRES sequence that each one corresponds to? The documentation is very unclear about it. I assume the correct algorithm is something like this: for every residue that has an insertion code, add 1 to the index of that residue and every other residue that comes after it?
Yes, I think you would add 1 to the index of the residue with the insertion code and every other residue after it, but I think you also need to maintain the gaps in subsequent residue numbers.
I used the following code to renumber the residues in mdtraj before loading into PDBFixer:
for chain in traj.topology.chains:
previous_res_cur = 0
previous_res_old = 0
for res in chain.residues:
res_old = res.resSeq
if res.resSeq == previous_res_cur:
res.resSeq += 1
elif res.resSeq == previous_res_old:
res.resSeq = previous_res_cur + 1
elif res.resSeq < previous_res_cur:
res.resSeq = previous_res_cur + (res_old - previous_res_old)
previous_res_cur = res.resSeq
previous_res_old = res_old
@ivyzhang999 the fix for your issue is in #184. Could you try it and make sure it works for you?
@peastman The fix works for me, thanks!
Thanks! I've merged it.
While writing up my response to https://github.com/pandegroup/openmm/issues/2087 I went poking around in
findMissingResidues
to figure out how it's been implemented, and I found a problem.First, to recap that discussion: 1) You can't rely on residue numbers alone to determine where gaps in the structures are. Residues numbers are often chosen on the basis of an alignment with a family of proteins and as such can jump up or down. If you want to take an approach like this, you should use
r.index
instead ofr.id
. Example.2) You can't rely on
SEQRES
to find gaps in the structure. For example, ifSEQRES
says the sequence isAAA
and the coordinate section saysAA
, there are 3 places you could put the missing A.3) I'm not a PDB historian so I don't know why it's in a REMARK section, but REMARK 465 is supposed to be where you look to find where the missing residues are and given the template they provide, it is intended to be machine readable.
I went looking in the code for
findMissingResidues
to understand it. As far as I can tell, it makes these assumptions: 1) it does not useREMARK 465
. This means it could be fooled bySEQRES
resulting in misthreaded structures. 2) By not usingREMARK 465
, it doesn't know if the missing residues are already named (i.e., id and insertionCode) 3) it assumes residue numbers are sequential 4) It doesn't provide any warnings if something looks suspicious.Take a look at this line:
the key used for residues
int(r.id)-minResidue
is not the correct index intoSEQRES
when the structure has insertion codes or when jumps in residue numbers are intended. For example, this code won't work for any antibody structures where the complementarity determining regions (CDRs) all have insertion codes. Take for example 1hzh, one of the PDB-101 Molecule of the Month structures.In its chain K, it has many missing residues:
but
findMissingResidues
doesn't find any of them.I wouldn't know for sure until trying, but it seems like addressing this could work like this if the PDB has a
REMARK 465
section: 1) start saving theREMARK 465
section of the PDB file when read in 2) the number of residues in the structure + number of missing residues should be equal to the length ofSEQRES
3) insert None into the structure's sequence at the index specified bySSSEQ
. At least in this example, it looks likeSSSEQ
is indexed in residue number space. Perhaps we must assume sane residue numbering near these gaps. In this example, it looks like we'll still be missing residues K128 and K129. 4) it looks likeSSSEQ
and the provided insertion code are supposed to be used in the new residues. 5) compare the structure's sequence withSEQRES
and throw an exception or at least a warning if they don't match.If it doesn't have a
REMARK 465
section, then at the very least,findMissingResidues
should be refactored to user.index
(i.e., the index in the chain) rather thanr.id
.