ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

`struct_ref` alignment error #64

Closed brindakv closed 2 years ago

brindakv commented 2 years ago

While using make-mmcif.py to read a complete mmCIF file and write it out, it gives an alignment error with the reference sequence. 'Traceback (most recent call last):\n File "~/make-mmcif.py", line 43, in <module>\n ihm.dumper.write(fhout,\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 3184, in write\n d.dump(system, writer)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 447, in dump\n self._check_reference_sequence(e, r)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 404, in _check_reference_sequence\n self._check_alignment(entity, ref, align)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 420, in _check_alignment\n check_rng(db_rng, ref.sequence, "db_begin,db_end", ref)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 415, in check_rng\n raise IndexError("Alignment.%s for %s is (%d-%d), "\nIndexError: Alignment.db_begin,db_end for <ihm.reference.UniProtSequence(Q6A037)> is (850-893), out of range 1-44'.

The struct_ref and struct_ref_seq tables are as follows:

#
loop_
_struct_ref.id
_struct_ref.db_name
_struct_ref.db_code
_struct_ref.pdbx_db_accession
_struct_ref.entity_id
_struct_ref.pdbx_seq_one_letter_code
_struct_ref.pdbx_align_begin
1 UNP N4BP1_MOUSE Q6A037 1
;ETSELREALLKIFPDSEQKLKIDQILAAHPYMKDLNALSALVLD
;
850
2 UNP UBC_HUMAN P0CG48 2
;MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN
IQKESTLHLVLRLRGG
;
1
#
loop_
_struct_ref_seq.align_id
_struct_ref_seq.ref_id
_struct_ref_seq.seq_align_beg
_struct_ref_seq.seq_align_end
_struct_ref_seq.db_align_beg
_struct_ref_seq.db_align_end
1 1 4 47 850 893
2 2 1 76   1  76
#
benmwebb commented 2 years ago

This looks right to me. You've said that N4BP1_MOUSE in UniProt is a total of 44 residues long: "ETSELREALLKIFPDSEQKLKIDQILAAHPYMKDLNALSALVLD" and then you've said your model starts at index 850. Obviously 850>44 so that can't be right.

https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct_ref.pdbx_seq_one_letter_code.html states that this should be the "Database chemical sequence". python-ihm interprets this as meaning the full database sequence (and deposited IMP models are made this way, although maybe it's not an issue as usually the modeled sequence is the entire database sequence). Are you saying that it can actually be just some subset of the sequence? Or maybe it has to be a subset starting at pdbx_align_begin (although it looks like the end of the alignment isn't fixed in that case)? If so, we can change python-ihm accordingly. (Either way, the description in the pdbx dictionary could do with being fleshed out a bit.)

benmwebb commented 2 years ago

I picked a random example from PDB where pdbx_align_begin is non-zero and it certainly looks like a subset can be provided, but the align begin/end seem totally wrong, so either I'm very confused (maybe there is another offset somewhere to be factored in?) or this is not enforced. 1NSO says it is 163-269 from UniProt P04024, but pdbx_seq_one_letter_code looks like 761-867 to me (and to the sequence view at https://www.rcsb.org/structure/1NSO ).

brindakv commented 2 years ago

This looks right to me. You've said that N4BP1_MOUSE in UniProt is a total of 44 residues long: "ETSELREALLKIFPDSEQKLKIDQILAAHPYMKDLNALSALVLD" and then you've said your model starts at index 850. Obviously 850>44 so that can't be right.

The length of the sequence is 44 but the index is not required to start at 1. In this case, it starts at 850 and ends at 893, which is 44 residues long.

https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct_ref.pdbx_seq_one_letter_code.html states that this should be the "Database chemical sequence". python-ihm interprets this as meaning the full database sequence (and deposited IMP models are made this way, although maybe it's not an issue as usually the modeled sequence is the entire database sequence). Are you saying that it can actually be just some subset of the sequence?

Yes.

Or maybe it has to be a subset starting at pdbx_align_begin (although it looks like the end of the alignment isn't fixed in that case)?

Yes, it can be a subset of the reference sequence starting at pdbx_align_begin. The complete sequence is not required. The begin and end of the alignment is handled in struct_ref_seq.

If so, we can change python-ihm accordingly. (Either way, the description in the pdbx dictionary could do with being fleshed out a bit.)

brindakv commented 2 years ago

I picked a random example from PDB where pdbx_align_begin is non-zero and it certainly looks like a subset can be provided, but the align begin/end seem totally wrong, so either I'm very confused (maybe there is another offset somewhere to be factored in?) or this is not enforced. 1NSO says it is 163-269 from UniProt P04024, but pdbx_seq_one_letter_code looks like 761-867 to me (and to the sequence view at https://www.rcsb.org/structure/1NSO ).

I think the struct_ref data in 1NSO is incorrect. A newer structure, 7OKP, is a better example.