ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Better handle seq_id for waters, branched entities #130

Closed benmwebb closed 4 months ago

benmwebb commented 6 months ago

For any given atom or residue, python-ihm needs to know the 1-based index ("residue number") so that it can map to the chemical component object, which is stored in the Entity in a simple Python list (Entity.sequence). We also need a way programatically for the user to address a specific residue in an Entity. For polymers, this is simple (it is just seq_id, and so we use seq_id internally for all entity types, even though PDB itself only defines it for polymers). For non-polymers, it is also simple (it is always 1).

However, when reading atom_site it is more difficult to determine this index for waters and branched entities, both of which have no label_seq_id. Currently we use pdbx_nonpoly_scheme.ndb_seq_num for waters and pdbx_branch_scheme.num for branched entities, so when reading atom_site we can map auth_seq_id to the corresponding number. But this requires that the scheme tables are present and that they precede atom_site in the file (and that ndb_seq_num always counts from 1, which may not be true). (Atom or Sphere objects are streamed by python-ihm to the calling application; we cannot go back later and reassign their ids. The ids are also integers, so even if we relaxed the "must index the Entity.sequence array" rule, we cannot use author-provided data, which may be strings.)

One possible solution would be to treat atom_site as ground truth and - for any atom read in with missing seq_id assign a sequential seq_id which maps to the given auth_seq_id. Any information from pdbx_nonpoly_scheme (e.g. for missing atoms, or for the pdb_seq_num to auth_seq_num mapping) can then be added at finalize time (essentially, use pdb_seq_num as the key into this table rather than ndb_seq_num). This complicates handling pdbx_branch_scheme a little since our internal seq_id now may not match pdbx_branch_scheme.num, so we would need to map one to the other.

aozalevsky commented 6 months ago

@brindakv Ben actually created the issue about seq_ids, so may be you can describe the rules for auth_seq_id here. Btw, i also found a paragraph about numbering in python-ihm docs. It might also require a revision.

benmwebb commented 6 months ago

@brindakv Ben actually created the issue about seq_ids, so may be you can describe the rules for auth_seq_id here.

This issue is a little different as it pertains to how we assign an internal ID when we read branched entities (or waters, but that is less complex). Currently we rely on the scheme tables to map author-provided to internal IDs, but they may be missing or incomplete.

@brindakv was asking about assigning auth_seq_id to waters to match PDB practice. To my mind this would happen on write, not read.

Btw, i also found a paragraph about numbering in python-ihm docs. It might also require a revision.

Yes, I updated that after our conversation last time to try to explain how the two "author-provided" IDs differ. But sure, could possibly be done better.