Closed benmwebb closed 4 months ago
@brindakv Ben actually created the issue about seq_ids, so may be you can describe the rules for auth_seq_id
here. Btw, i also found a paragraph about numbering in python-ihm docs. It might also require a revision.
@brindakv Ben actually created the issue about seq_ids, so may be you can describe the rules for
auth_seq_id
here.
This issue is a little different as it pertains to how we assign an internal ID when we read branched entities (or waters, but that is less complex). Currently we rely on the scheme tables to map author-provided to internal IDs, but they may be missing or incomplete.
@brindakv was asking about assigning auth_seq_id
to waters to match PDB practice. To my mind this would happen on write, not read.
Btw, i also found a paragraph about numbering in python-ihm docs. It might also require a revision.
Yes, I updated that after our conversation last time to try to explain how the two "author-provided" IDs differ. But sure, could possibly be done better.
For any given atom or residue, python-ihm needs to know the 1-based index ("residue number") so that it can map to the chemical component object, which is stored in the Entity in a simple Python list (
Entity.sequence
). We also need a way programatically for the user to address a specific residue in an Entity. For polymers, this is simple (it is justseq_id
, and so we use seq_id internally for all entity types, even though PDB itself only defines it for polymers). For non-polymers, it is also simple (it is always 1).However, when reading
atom_site
it is more difficult to determine this index for waters and branched entities, both of which have nolabel_seq_id
. Currently we usepdbx_nonpoly_scheme.ndb_seq_num
for waters andpdbx_branch_scheme.num
for branched entities, so when readingatom_site
we can mapauth_seq_id
to the corresponding number. But this requires that the scheme tables are present and that they precedeatom_site
in the file (and thatndb_seq_num
always counts from 1, which may not be true). (Atom
orSphere
objects are streamed by python-ihm to the calling application; we cannot go back later and reassign their ids. The ids are also integers, so even if we relaxed the "must index the Entity.sequence array" rule, we cannot use author-provided data, which may be strings.)One possible solution would be to treat
atom_site
as ground truth and - for any atom read in with missingseq_id
assign a sequentialseq_id
which maps to the givenauth_seq_id
. Any information frompdbx_nonpoly_scheme
(e.g. for missing atoms, or for thepdb_seq_num
toauth_seq_num
mapping) can then be added at finalize time (essentially, usepdb_seq_num
as the key into this table rather thanndb_seq_num
). This complicates handlingpdbx_branch_scheme
a little since our internal seq_id now may not matchpdbx_branch_scheme.num
, so we would need to map one to the other.