Closed mtrellet closed 6 years ago
Maybe I don't understand you completely - could you come up with a self-contained example (perhaps as a gist) that reproduces the problem? You can't have gaps in seq_id
by construction - in mmCIF seq_id
always starts at 1 and is contiguous, and indexes into the entity sequence. If there are gaps in the PDB numbering you either need to add the missing residues to the entity sequence (and then just not provide coordinates for them) or map the non-contiguous PDB numbering to contiguous mmCIF numbering, using auth_seq_id_map
- see https://python-ihm.readthedocs.io/en/latest/usage.html#residue-numbering
Thanks for your reply Ben!
You were right, the issue comes from the mapping indeed, I've just put the starting shift in the auth_seq_id_map
and not the full mapping. See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/919c4fd98f98f45011a20522d494501343c7c21a (1st revision)
I've just tested with a list containing the numbering gap instead: See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/b5b0c8ef4ada3265eed7cbf6936b76b21f3f823e (2nd revision)
The list used as auth_seq_id_map
(gap in bold):
[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]
However, two issues at first glance:
_pdbx_poly_seq_scheme.pdb_seq_num
field of my mmCIF file that should start at 5
(see list above).
_pdbx_poly_seq_scheme.asym_id
_pdbx_poly_seq_scheme.entity_id
_pdbx_poly_seq_scheme.seq_id
_pdbx_poly_seq_scheme.mon_id
_pdbx_poly_seq_scheme.pdb_seq_num
_pdbx_poly_seq_scheme.auth_seq_num
_pdbx_poly_seq_scheme.pdb_mon_id
_pdbx_poly_seq_scheme.auth_mon_id
_pdbx_poly_seq_scheme.pdb_strand_id
A 1 1 U 6 6 U U A
A 1 2 G 7 7 G G A
A 1 3 A 8 8 A A A
A 1 4 A 9 9 A A A
If I traceback the writer function (_PolySeqSchemeDumper.dump
) I can see that the auth_seq_num
(where the error occurs in the mmCIF file) is set with the following line:
auth_seq_num = asym._get_auth_seq_id(num+1)
Removing the +1
outputs the proper _pdbx_poly_seq_scheme.pdb_seq_num
. However, this also outputs 0
as first pdb_seq_num
when no auth_seq_id_map
is provided... So there must be a better solution..!
I've printed the seq_id_range
of my AsymUnits after initializing them with the auth_seq_id_map
and it seems fine. My guess is that the error I get when dumping the system is that the get_atoms
function of my MyModel
class is overwriting the seq_id
and then introduces the gap in seq_id
. And it later on leads to the error I reported in my first post. So I guess I need to modify the get_atoms()
function to not reassign the seq_id
field. However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter? This should not be possible to overwrite it, or am I getting that wrong again?
Thanks again for your time!
In both cases, you need to use seq_id
(not auth_seq_id
or "PDB" numbering, e.g. r.id[1]
) when referencing any IHM object - e.g. when you create your AsymUnitRange
or Atom
objects. One way to achieve this would be to keep a reverse mapping.
e.g. in your second case replace
seqA_ids = []
for r in s.get_residues():
if r.get_parent().id == "A" and r.resname != "HOH":
seqA.append(r.resname.strip())
seqA_ids.append(r.id[1])
with
seq_id_from_pdb_A = {}
pdb_from_seq_id_A = {}
seq_id = 0
for r in s.get_residues():
if r.get_parent().id == "A" and r.resname != "HOH":
seq_id += 1
seqA.append(r.resname.strip())
seq_id_from_from_pdb_A[r.id(1)] = seq_id
pdb_from_seq_id_A[seq_id] = r.id(1)
Then use seq_id_from_pdb_A[r.id(1)]
when making your Atom
or AsymUnitRange
objects, and pass pdb_from_seq_id_A
as auth_seq_id_map
to the AsymUnit
constructor.
The list used as auth_seq_id_map (gap in bold):
[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]
auth_seq_id_map
is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]
. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list
or tuple
here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0
used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.
My guess is that the error I get when dumping the system is that the
get_atoms
function of myMyModel
class is overwriting theseq_id
and then introduces the gap inseq_id
Right, it's because you're using the PDB numbering, not seq_id, here:
atomsA.append((c.id, r.id[1], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))
This should be
atomsA.append((c.id, seq_id_from_pdb_A[[r.id[1]], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))
However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter?
That wouldn't work because seq_id
is per-residue, and get_atoms()
could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).
auth_seq_id_map is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list or tuple here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0 used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.
Yes indeed, that what I was tempted to do (start with a null in my list) I just was not sure it was a "good practice" and that a neater way could exist. And actually a dictionary is way better (also for genericity purposes).
That wouldn't work because seq_id is per-residue, and get_atoms() could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).
Very true, forgot this "detail". Thanks for the explanation!
Just to summarize, everything worked fine with the new auth_seq_id_map
, thanks a lot for your complete answer! I updated the gist in case someone is interested...
And another (almost) related question: Do we plan to output the _atom_site.auth_seq_id
records or this would be too much redundant with the _pdbx_poly_seq_scheme.auth_seq_num
? I know we've done it in the model we deposited on PDB-dev for instance.
Do we plan to output the
_atom_site.auth_seq_id
records
I don't have the library output this because (as you say) it's redundant, and I prefer not to make the files any larger than they have to be. But it would be easy enough to add if needed.
I have spotted an issue when I try to output an mmCIF file from a PDB file following the steps desribed in
examples/simple-docking.py
.If the PDB presents a gap in the residue IDs sequence, the dumper module triggers the following error:
As reported in the error, the faulty line is
comp = atom.asym_unit.entity.sequence[atom.seq_id-1]
that looks for an object stored in the listatom.asym_unit.entity.sequence
by taking theatom.seq_id
value. However, in the case where the model has gaps, this seq_id can have non-sequential values leading to an out of range error. I fixed this issue by adding a checking step in the loop iterating over the model (indumper.py
line 821):The output is really similar to what I would expect and what is currently deposited in the PDB-dev DB for this particular structure. However, there could be a better/smarter way to handle this...