Closed bienchen closed 1 year ago
So the issue here is that in python-ihm the auth_seq_id
is set in the auth_seq_id_map
argument to AsymUnit
. This maps a sequence index (1,2,3,..., which for polymers is the same as seq_id
) to auth_seq_id
. The problem is that when we come to handle Atom
objects for nonpolymers, it can't tell what the sequence index is because they all have seq_id = None
. Currently it just assumes 1 which works OK for ligands, where you generally only have one per asym, but not when you have multiple waters.
One solution here would be to require that an integer seq_id
be assigned to water Atom
objects at construction time, just as is done for polymer atoms. This would make things unambiguous. To avoid breaking existing code, seq_id=None
could be treated as equivalent to seq_id=1
. The downside to this approach is that the dictionary uses .
for nonpolymer seq_id of course, so we'd have to discard it when we output the table.
An alternative would be to have the atom_site
dumper ignore the Atom
seq_id for nonpolymers and assign them sequential IDs as it sees them. This wouldn't work of course if you defined an entity containing three waters and then had Model.get_atoms
return [third_water, first_water, second_water]
or even [first_water, third_water]
but maybe nobody would be crazy enough to do that.
Thinking of template based modelling, using templates from the PDB, the author assigned seq_id's for water are not always sequential starting at 1. E.g. in entry 5BS8, it looks like water molecules are together in chains with the polymers by author annotation (auth_asym_id
, auth_seq_id
). So numbering of water molecules starts at any number. Also, while modelling a user may decide to only keep water that mediates ligand interactions and discard all other water. So I think beside having working seq_id's for water, they also should work with auth_seq_id_map
. Otherwise discussions of models referring to the original template may get pretty awkward. That makes a solution where I need to set seq_id
for water and seq_id=None
becomes 1
look like the best option to me atm, if that also helps to get auth_seq_id_map
to work.
Thanks a lot! I tested the fix, mmCIF files produced now do pass CifCheck
.
When dealing with multiple water molecules in a single mmCIF file,
auth_seq_id
items seem off in the_atom_site
category. For_pdbx_nonpoly_scheme.auth_seq_num
, water molecules are numbered 1-n, which seems to be correct. For_atom_site.auth_seq_id
, water molecules are always 1 while it should be the corresponding value in_pdbx_nonpoly_scheme.auth_seq_num
. The RCSB validation toolCifCheck
reports this as an error in the file.Here is a script, producing a mmCIF file with this issue:
From the produced file
water_only.cif
, categroies_pdbx_nonpoly_scheme
and_atom_site
are the interesting ones._pdbx_nonpoly_scheme.auth_seq_num
has values 1, 2, 3 while_atom_site.auth_seq_id
is 1 for everyHOH
but should reflect the 1, 2, 3 as in_pdbx_nonpoly_scheme.auth_seq_num
.