prody / ProDy

A Python Package for Protein Dynamics Analysis
http://prody.csb.pitt.edu
Other
421 stars 154 forks source link

parsePDB parses extra residues on 5FXN chain A #1045

Closed jonathanking closed 4 years ago

jonathanking commented 4 years ago

RCSB sequence page for 5FXN

According to the above webpage, the end of the sequence of 5FXN should be '...VGVK', where 'K' is not present in the structure. However, ProDy mistakenly parses '...VGVVK' from the file. This can be explained because there are extra 'V' and 'K' residues on different chains and segments beyond the end of the sequence. These extra residues are unfortunately labeled as being chain A in the file, even though a TER record separates them like so: '...VGV [TER] VK' and the last two residues are located very far away.

If you open up this file in PyMol, it may be more clear what is going on. I'm not sure if this is something that should be solved by ProDy, or if the fault lies with the creator of the PDB file.

Either way, it may be possible to fix if ProDy can effectively recognize and handle this in the same way that PyMol is able to recognize that the final 'VK' does not belong on the same chain.

jamesmkrieger commented 4 years ago

The way to deal with this is to select the part that you want and only get the sequence for that e.g. ag.ca.select('resnum 1 to 315').getSequence(). You can see which resnums you need by using ag.ca.getResnums() first and in this case it ends with 312, 313, 314, 315, 1318, 1319])

ProDy is only going to provide the information that is in the columns of the PDB file, which in this case aren't helpful enough. If we start trying to be more clever, we risk doing things that other users aren't expecting so there are no plans to do that.

One other thing you could do in this case is parse the mmCIF file instead, which has segments in it that PyMOL and ProDy can recognise and segment A is the part that you want:

In [1]: from prody import * 
   ...: import numpy as np 
   ...: import matplotlib.pyplot as plt 
   ...: plt.ion()                                                               

In [2]: ag = parseCIF('5fxn')                                                   
@> 2719 atoms and 1 coordinate set(s) were parsed in 0.07s.

In [3]: list(ag.getHierView())                                                  
Out[3]: 
[<Chain: A from Segment A from 5fxn (315 residues, 2414 atoms)>,
 <Chain: A from Segment B from 5fxn (1 residues, 7 atoms)>,
 <Chain: A from Segment C from 5fxn (1 residues, 10 atoms)>,
 <Chain: A from Segment D from 5fxn (1 residues, 1 atoms)>,
 <Chain: A from Segment E from 5fxn (1 residues, 1 atoms)>,
 <Chain: A from Segment F from 5fxn (1 residues, 1 atoms)>,
 <Chain: A from Segment G from 5fxn (1 residues, 1 atoms)>,
 <Chain: A from Segment H from 5fxn (1 residues, 1 atoms)>,
 <Chain: A from Segment I from 5fxn (283 residues, 283 atoms)>]

In [4]: ag['A'].ca.getSequence()                                                
Out[4]: 'TGTSTVGVGRGVLGDQKNINTTYSTYYYLQDNTRGDGIFTYDAKYRTTLPGSLWADADNQFFASYDAPAVDAHYYAGVTYDYYKNVHNRLSYDGNNAAIRSSVHYSQGYNNAFWNGSEMVYGDGDGQTFIPLSGGIDVVAHELTHAVTDYTAGLIYQNESGAINEAISDIFGTLVEFYANKNPDWEIGEDVYTPGISGDSLRSMSDPAKYGDPDHYSKRYTGTQDNGGVHINSGIINKAAYLISQGGTHYGVSVVGIGRDKLGKIFYRALTQYLTPTSNFSQLRAAAVQSATDLYGSTSQEVASVKQAFDAVGV'
jonathanking commented 4 years ago

Thanks for the explanation, James. I totally understand that this is an edge case, and I'll use one of your suggestions instead.

jamesmkrieger commented 4 years ago

You're welcome