ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Error when gap(s) present in the input structure #15

Closed mtrellet closed 6 years ago

mtrellet commented 6 years ago

I have spotted an issue when I try to output an mmCIF file from a PDB file following the steps desribed in examples/simple-docking.py.

If the PDB presents a gap in the residue IDs sequence, the dumper module triggers the following error:

Traceback (most recent call last):
  File "16srna_ksga.py", line 170, in <module>
    ihm.dumper.write(fh, [system])
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 1436, in write
    d.dump(system, writer)
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 771, in dump
    seen_types = self.dump_atoms(system, writer)
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 825, in dump_atoms
    comp = atom.asym_unit.entity.sequence[atom.seq_id-1]
IndexError: tuple index out of range

As reported in the error, the faulty line is comp = atom.asym_unit.entity.sequence[atom.seq_id-1] that looks for an object stored in the list atom.asym_unit.entity.sequence by taking the atom.seq_id value. However, in the case where the model has gaps, this seq_id can have non-sequential values leading to an out of range error. I fixed this issue by adding a checking step in the loop iterating over the model (in dumper.py line 821):

for group, model in system._all_models():
    c = 0
    prev = None
    for atom in model.get_atoms():
        # First iteration
        if c == 0 and not prev:
            prev = (atom.seq_id, atom.asym_unit.id)
        # Switching to another asymetric unit (assuming all AsymUnit have different chain ids)
        elif atom.asym_unit.id != prev[1]:
            prev = (atom.seq_id, atom.asym_unit.id)
            c = 0
        # Switching to another residue
        elif atom.seq_id != prev[0]:
            c += 1
            prev = (atom.seq_id, atom.asym_unit.id)
        comp = atom.asym_unit.entity.sequence[c]

The output is really similar to what I would expect and what is currently deposited in the PDB-dev DB for this particular structure. However, there could be a better/smarter way to handle this...

benmwebb commented 6 years ago

Maybe I don't understand you completely - could you come up with a self-contained example (perhaps as a gist) that reproduces the problem? You can't have gaps in seq_id by construction - in mmCIF seq_id always starts at 1 and is contiguous, and indexes into the entity sequence. If there are gaps in the PDB numbering you either need to add the missing residues to the entity sequence (and then just not provide coordinates for them) or map the non-contiguous PDB numbering to contiguous mmCIF numbering, using auth_seq_id_map - see https://python-ihm.readthedocs.io/en/latest/usage.html#residue-numbering

mtrellet commented 6 years ago

Thanks for your reply Ben!

You were right, the issue comes from the mapping indeed, I've just put the starting shift in the auth_seq_id_map and not the full mapping. See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/919c4fd98f98f45011a20522d494501343c7c21a (1st revision)

I've just tested with a list containing the numbering gap instead: See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/b5b0c8ef4ada3265eed7cbf6936b76b21f3f823e (2nd revision)

The list used as auth_seq_id_map (gap in bold):

[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]

However, two issues at first glance:

  1. I get a shift in the _pdbx_poly_seq_scheme.pdb_seq_num field of my mmCIF file that should start at 5 (see list above).
    _pdbx_poly_seq_scheme.asym_id
    _pdbx_poly_seq_scheme.entity_id
    _pdbx_poly_seq_scheme.seq_id
    _pdbx_poly_seq_scheme.mon_id
    _pdbx_poly_seq_scheme.pdb_seq_num
    _pdbx_poly_seq_scheme.auth_seq_num
    _pdbx_poly_seq_scheme.pdb_mon_id
    _pdbx_poly_seq_scheme.auth_mon_id
    _pdbx_poly_seq_scheme.pdb_strand_id
    A 1 1 U 6 6 U U A
    A 1 2 G 7 7 G G A
    A 1 3 A 8 8 A A A
    A 1 4 A 9 9 A A A

If I traceback the writer function (_PolySeqSchemeDumper.dump) I can see that the auth_seq_num (where the error occurs in the mmCIF file) is set with the following line: auth_seq_num = asym._get_auth_seq_id(num+1) Removing the +1 outputs the proper _pdbx_poly_seq_scheme.pdb_seq_num. However, this also outputs 0 as first pdb_seq_num when no auth_seq_id_map is provided... So there must be a better solution..!

  1. I still get the index out of range error at the same step than before.

I've printed the seq_id_range of my AsymUnits after initializing them with the auth_seq_id_map and it seems fine. My guess is that the error I get when dumping the system is that the get_atoms function of my MyModel class is overwriting the seq_id and then introduces the gap in seq_id. And it later on leads to the error I reported in my first post. So I guess I need to modify the get_atoms() function to not reassign the seq_id field. However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter? This should not be possible to overwrite it, or am I getting that wrong again?

Thanks again for your time!

benmwebb commented 6 years ago

In both cases, you need to use seq_id (not auth_seq_id or "PDB" numbering, e.g. r.id[1]) when referencing any IHM object - e.g. when you create your AsymUnitRange or Atom objects. One way to achieve this would be to keep a reverse mapping.

e.g. in your second case replace

seqA_ids = []

for r in s.get_residues():
    if r.get_parent().id == "A" and r.resname != "HOH":
        seqA.append(r.resname.strip())
        seqA_ids.append(r.id[1])

with

seq_id_from_pdb_A = {}
pdb_from_seq_id_A = {}

seq_id = 0
for r in s.get_residues():
    if r.get_parent().id == "A" and r.resname != "HOH":
        seq_id += 1
        seqA.append(r.resname.strip())
        seq_id_from_from_pdb_A[r.id(1)] = seq_id
        pdb_from_seq_id_A[seq_id] = r.id(1)

Then use seq_id_from_pdb_A[r.id(1)] when making your Atom or AsymUnitRange objects, and pass pdb_from_seq_id_A as auth_seq_id_map to the AsymUnit constructor.

The list used as auth_seq_id_map (gap in bold):

[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]

auth_seq_id_map is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list or tuple here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0 used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.

My guess is that the error I get when dumping the system is that the get_atoms function of my MyModel class is overwriting the seq_id and then introduces the gap in seq_id

Right, it's because you're using the PDB numbering, not seq_id, here:

atomsA.append((c.id, r.id[1], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))

This should be

atomsA.append((c.id, seq_id_from_pdb_A[[r.id[1]], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))

However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter?

That wouldn't work because seq_id is per-residue, and get_atoms() could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).

mtrellet commented 6 years ago

auth_seq_id_map is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list or tuple here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0 used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.

Yes indeed, that what I was tempted to do (start with a null in my list) I just was not sure it was a "good practice" and that a neater way could exist. And actually a dictionary is way better (also for genericity purposes).

That wouldn't work because seq_id is per-residue, and get_atoms() could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).

Very true, forgot this "detail". Thanks for the explanation!

Just to summarize, everything worked fine with the new auth_seq_id_map, thanks a lot for your complete answer! I updated the gist in case someone is interested...

And another (almost) related question: Do we plan to output the _atom_site.auth_seq_id records or this would be too much redundant with the _pdbx_poly_seq_scheme.auth_seq_num? I know we've done it in the model we deposited on PDB-dev for instance.

benmwebb commented 6 years ago

Do we plan to output the _atom_site.auth_seq_id records

I don't have the library output this because (as you say) it's redundant, and I prefer not to make the files any larger than they have to be. But it would be easy enough to add if needed.