prody / ProDy

A Python Package for Protein Dynamics Analysis
http://prody.csb.pitt.edu
Other
427 stars 155 forks source link

Failure to correctly parse all ligand chains in complex PDB file #1840

Closed amorehead closed 6 months ago

amorehead commented 7 months ago

Hello.

It seems that with Prody version 2.4.1 (same behavior with the latest changes in master) not all unique ligand chains can be parsed for the attached complex (protein-ligand-nucleic acid) PDB file using the following code.

pdb = parsePDB(complex_filepath)
protein = pdb.select("protein")
ligand = pdb.select("not (protein or nucleic or water)")

In particular, chain M amongst the ligand chains is ignored by this parsing and selection logic. Would you happen to know what might be causing this parsing failure? I've tried viewing ligand.getChids() as well as list(ligand.getHierView().iterChains()), but chain M always seems to be missing. I also tried calling ligand.setSegnames(ligand.getChids()), but this didn't solve the issue either. Might this be related to some kind of misalignment between segment IDs and chain IDs (even though I can't spot such a misalignment)?

Complex.pdb.txt

jamesmkrieger commented 7 months ago

Can you give us a pdb id to check this? Is it definitely a pdb file or could it be an mmcif?

there’s an option unite_chains for parsing mmcif that you could try and see if that helps

amorehead commented 6 months ago

It's definitely a PDB file. This complex.pdb file corresponds to the protein-ligand-nucleic acid complex structure which is denoted as target H1171v1 in CASP15 (and corresponds to PDB ID: 7PBL (https://www.rcsb.org/structure/7pbl)).

jamesmkrieger commented 6 months ago

ok, thanks

amorehead commented 6 months ago

Note that, using BioPython, I was able to parse all of the six ligand chains successfully. With ProDy, it's possible that chain M having a different residue name from all the other ligand chain's residue names is somehow throwing off ProDy.

jamesmkrieger commented 6 months ago

Can you provide the file please? I don't get a chain M when parsing with the pdb id and displaying the pdb file on the rcsb pdb doesn't show one either.

jamesmkrieger commented 6 months ago

I also don't get a chain M when using Biopython from the file downloaded from the pdb by prody:

In [15]: ag = parsePDB('7PBL', compressed=False)

In [16]: struct = PDBParser().get_structure('7pbl', file='7pbl.pdb')
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 32028.
  warnings.warn(
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 32074.
  warnings.warn(
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 32120.
  warnings.warn(
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 32166.
  warnings.warn(
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 32205.
  warnings.warn(
/home/jkrieger/software/miniconda/envs/scipion3/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 32244.
  warnings.warn(

In [17]: struct.get_chains()
Out[17]: <generator object Structure.get_chains at 0x7efcc6133350>

In [18]: list(struct.get_chains())
Out[18]: 
[<Chain id=A>,
 <Chain id=B>,
 <Chain id=C>,
 <Chain id=D>,
 <Chain id=E>,
 <Chain id=F>,
 <Chain id=G>,
 <Chain id=U>,
 <Chain id=V>]
amorehead commented 6 months ago

Apologies, I should clarify, the file was uploaded in my initial message (i.e., here: https://github.com/prody/ProDy/files/14579373/Complex.pdb.txt). You can just rename it to Complex.pdb.

jamesmkrieger commented 6 months ago

Thank you. Sorry I missed it

jamesmkrieger commented 6 months ago

ok, I think the problem is that the ligand in chain M is ADP and prody is treating that as nucleic. You can select nucleoside separately to get it back:

In [2]: pdb = parsePDB('complex.pdb')

In [3]: ligand = pdb.select("not (protein or nucleic or water) or nucleoside")

In [4]: list(ligand.getHierView())
Out[4]: 
[<Chain: I from Segment H from complex (1 residues, 45 atoms)>,
 <Chain: J from Segment I from complex (1 residues, 45 atoms)>,
 <Chain: K from Segment J from complex (1 residues, 45 atoms)>,
 <Chain: L from Segment K from complex (1 residues, 45 atoms)>,
 <Chain: M from Segment L from complex (1 residues, 39 atoms)>,
 <Chain: N from Segment M from complex (1 residues, 45 atoms)>,
 <Chain: O from Segment N from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment O from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment P from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment Q from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment R from complex (1 residues, 1 atoms)>]

I'll make a pull request adding a new nucleic_poly flag to make this easier. I also noticed we were missing a few nucleosides so I'll add them too.

jamesmkrieger commented 6 months ago

Oh, it also looks like ProDy is completely wrong/confusing about what nucleosides and nucleotides are. It uses nucleoside for "nucleoside derivatives", which should then be nucleotides. Then it uses nucleotide for those incorporated into DNA or RNA.

In that case, you can use not nucleotide instead of not nucleic to get the behaviour you want:

In [1]: from prody import *

In [2]: pdb = parsePDB('complex.pdb')

In [3]: ligand = pdb.select("not (protein or nucleotide or water)")

In [4]: list(ligand.getHierView())
Out[4]: 
[<Chain: I from Segment H from complex (1 residues, 45 atoms)>,
 <Chain: J from Segment I from complex (1 residues, 45 atoms)>,
 <Chain: K from Segment J from complex (1 residues, 45 atoms)>,
 <Chain: L from Segment K from complex (1 residues, 45 atoms)>,
 <Chain: M from Segment L from complex (1 residues, 39 atoms)>,
 <Chain: N from Segment M from complex (1 residues, 45 atoms)>,
 <Chain: O from Segment N from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment O from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment P from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment Q from complex (1 residues, 1 atoms)>,
 <Chain: O from Segment R from complex (1 residues, 1 atoms)>]

In [5]: set(ligand.getResnames())
Out[5]: {'ADP', 'AGS', 'MG'}
jamesmkrieger commented 6 months ago

It's also worth noting that protein points to any amino acid and not just those incorporated in polymers, and thus causes problems with ligand selection too.

amorehead commented 6 months ago

Thanks @jamesmkrieger for figuring this out! Is there anything else you need from me on my end to resolve this issue?

jamesmkrieger commented 6 months ago

You’re welcome. I don’t think so.