Missing _struct_ref_seq in cif generated from PDB without optional TER.

xvlaurent commented 4 months ago

I have a PDB file without TER between proteic chains like this one: 2po6_wo_ter.zip

When I run the following command in python:

import gemmi
structure=gemmi.read_pdb('2po6_wo_ter.pdb')
header = structure.make_mmcif_headers()
heade_.get_mmcif_category('_struct_ref_seq')

it returns:

{'align_id': ['1'], 'ref_id': ['8'], 'pdbx_strand_id': ['H'], 'seq_align_beg': [None], 'seq_align_end': [None], 'db_align_beg': ['48'], 'db_align_end': ['264'], 'pdbx_auth_seq_align_beg': ['29'], 'pdbx_seq_align_beg_ins_code': [None], 'pdbx_auth_seq_align_end': ['247'], 'pdbx_seq_align_end_ins_code': [None]}

I was expecting _struct_ref_seq entries for chains A to H, like what I have when I run the same commands on the official 2po6 RCSB PDB file that have TER records.

Thanks a lot for your work!

wojdyr commented 4 months ago

I tried it and I got:

{'align_id': ['1', '2', '3', '4', '5', '6'], 'ref_id': ['2', '3', '4', '6', '7', '8'], 'pdbx_strand_id': ['E', 'B', 'F', 'D', 'G', 'H'], …

Anyway, it still doesn't include chain A.

In the absence of the TER record, use:

structure.setup_entities()

to get the same result (in maybe 99.9% of cases) as you would with TER records.

This is probably the most confusing part of gemmi – it comes up quite often. The end of a polymer is not determined automatically because sometimes it's ambiguous. In particular, a non-standard monomer at the end of a chain can be either the last residue in the chain or the first ligand. Maybe I'll change it at some point (by adding a mandatory flag to read_pdb() and similar functions that would specify how to determine the end of a polymer). But this will be backward incompatible.

xvlaurent commented 4 months ago

I understand. I did not thought setup_entities() could also fix that type of issues.

Thanks for your answer!

project-gemmi / gemmi

Missing _struct_ref_seq in cif generated from PDB without optional TER. #299