ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Error while parsing sequence information in entry PDBDEV_00000030 #140

Closed aozalevsky closed 1 month ago

aozalevsky commented 1 month ago

python-ihm fails during retrieval of sequence information from entry 30. Here is the error message:

File [~/miniconda3/envs/python_ihm_latest/lib/python3.12/site-packages/ihm/__init__.py:1295](http://localhost:8889/home/arthur/miniconda3/envs/python_ihm_latest/lib/python3.12/site-packages/ihm/__init__.py#line=1294), in Residue._get_comp(self)
   1294 def _get_comp(self):
-> 1295     return self.entity.sequence[self.seq_id - 1]

IndexError: tuple index out of range

And the minimal code to reproduce:


import ihm, ihm.reader

fname = 'PDBDEV_00000030.cif'
encoding = 'utf-8'

with open(fname, encoding=encoding) as fh:
    system, = ihm.reader.read(fh)

# Iterate over all restraints datasets
for restr_ in system.restraints:
    # We are interested only in Chemical crosslinks
    if type(restr_) != ihm.restraint.CrossLinkRestraint:
        continue

    # Iterate over all crosslinks in the dataset
    for xl in restr_.cross_links:

        # get corresponding experimental crosslink
        exl = xl.experimental_cross_link

        # Extract residue names from atoms
        r1n = exl.residue1.comp.id

I traced the behavior back to 0.38 when the residue.comp attribute was introduced.

benmwebb commented 1 month ago

This looks like incorrect data in the mmCIF file to me. The experimentally-identified cross link with ihm_cross_link_list.id=9 says that one end is at residue 182 in entity 2. But entity 2 contains only 174 residues. So an IndexError is the correct response here.

aozalevsky commented 1 month ago

@benmwebb Oh, sorry, I missed that. What would be your suggestion? Should it be fixed in the entry, or should I capture this in my code? Shouldn't there be a self-consistency data check during the deposition?

benmwebb commented 1 month ago

IMHO if there's incorrect data in the file it should be fixed in the entry by the authors or by PDB-Dev folks such as @brindakv. I thought this was checked at deposition but maybe it's easy to miss since the file still validates.