project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
205 stars 42 forks source link

pdb/cif entity subchain comparisons #309

Closed rimmartin closed 2 months ago

rimmartin commented 2 months ago

Hi @wojdyr ,

Been running 8r4q.pdb & 8r4q.cif thru alignment for finding sequence gaps. cif entity names are the _entity.id while for pdb the chain id. So started using the Entity::subchains to find the chain id's

cif gives subchain letters like

entity 1
  subchain A,C,E,G,I,K
entity 2
  subchain B,D,F,H,J,L

while pdb yields

entity A
  subchain Axp
entity B
  subchain Bxp
entity C
  subchain Cxp
  ...

Can I trust single capital first letter and ignore the xp or other postscripts such as x1 x2? Or is there a better parse of these subchain strings?

wojdyr commented 2 months ago

If you run setup_entities() after reading the file, the entities in both mmCIF and PDB should be arranged similarly (apart for different naems for subchains).

>>> st = gemmi.read_structure('/data/structures/divided/pdb/r4/pdb8r4q.ent.gz')
>>> st.setup_entities()
>>> st.entities[0]
<gemmi.Entity 'A' polymer polypeptide(L) object at 0x55a73eacce80>
>>> _.subchains
['Axp', 'Cxp', 'Exp', 'Gxp', 'Ixp', 'Kxp']

Can I trust single capital first letter and ignore the xp or other postscripts such as x1 x2?

It's better to check Entity::subchains as you did.