Closed brindakv closed 2 years ago
This looks right to me. You've said that N4BP1_MOUSE
in UniProt is a total of 44 residues long: "ETSELREALLKIFPDSEQKLKIDQILAAHPYMKDLNALSALVLD" and then you've said your model starts at index 850. Obviously 850>44 so that can't be right.
https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct_ref.pdbx_seq_one_letter_code.html states that this should be the "Database chemical sequence". python-ihm interprets this as meaning the full database sequence (and deposited IMP models are made this way, although maybe it's not an issue as usually the modeled sequence is the entire database sequence). Are you saying that it can actually be just some subset of the sequence? Or maybe it has to be a subset starting at pdbx_align_begin
(although it looks like the end of the alignment isn't fixed in that case)? If so, we can change python-ihm accordingly. (Either way, the description in the pdbx dictionary could do with being fleshed out a bit.)
I picked a random example from PDB where pdbx_align_begin
is non-zero and it certainly looks like a subset can be provided, but the align begin/end seem totally wrong, so either I'm very confused (maybe there is another offset somewhere to be factored in?) or this is not enforced. 1NSO says it is 163-269 from UniProt P04024, but pdbx_seq_one_letter_code
looks like 761-867 to me (and to the sequence view at https://www.rcsb.org/structure/1NSO ).
This looks right to me. You've said that
N4BP1_MOUSE
in UniProt is a total of 44 residues long: "ETSELREALLKIFPDSEQKLKIDQILAAHPYMKDLNALSALVLD" and then you've said your model starts at index 850. Obviously 850>44 so that can't be right.
The length of the sequence is 44 but the index is not required to start at 1. In this case, it starts at 850 and ends at 893, which is 44 residues long.
https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct_ref.pdbx_seq_one_letter_code.html states that this should be the "Database chemical sequence". python-ihm interprets this as meaning the full database sequence (and deposited IMP models are made this way, although maybe it's not an issue as usually the modeled sequence is the entire database sequence). Are you saying that it can actually be just some subset of the sequence?
Yes.
Or maybe it has to be a subset starting at
pdbx_align_begin
(although it looks like the end of the alignment isn't fixed in that case)?
Yes, it can be a subset of the reference sequence starting at pdbx_align_begin
. The complete sequence is not required. The begin and end of the alignment is handled in struct_ref_seq
.
If so, we can change python-ihm accordingly. (Either way, the description in the pdbx dictionary could do with being fleshed out a bit.)
I picked a random example from PDB where
pdbx_align_begin
is non-zero and it certainly looks like a subset can be provided, but the align begin/end seem totally wrong, so either I'm very confused (maybe there is another offset somewhere to be factored in?) or this is not enforced. 1NSO says it is 163-269 from UniProt P04024, butpdbx_seq_one_letter_code
looks like 761-867 to me (and to the sequence view at https://www.rcsb.org/structure/1NSO ).
I think the struct_ref
data in 1NSO
is incorrect. A newer structure, 7OKP, is a better example.
While using
make-mmcif.py
to read a complete mmCIF file and write it out, it gives an alignment error with the reference sequence.'Traceback (most recent call last):\n File "~/make-mmcif.py", line 43, in <module>\n ihm.dumper.write(fhout,\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 3184, in write\n d.dump(system, writer)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 447, in dump\n self._check_reference_sequence(e, r)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 404, in _check_reference_sequence\n self._check_alignment(entity, ref, align)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 420, in _check_alignment\n check_rng(db_rng, ref.sequence, "db_begin,db_end", ref)\n File "/usr/local/lib64/python3.9/site-packages/ihm/dumper.py", line 415, in check_rng\n raise IndexError("Alignment.%s for %s is (%d-%d), "\nIndexError: Alignment.db_begin,db_end for <ihm.reference.UniProtSequence(Q6A037)> is (850-893), out of range 1-44'.
The
struct_ref
andstruct_ref_seq
tables are as follows: