ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Water molecules, _pdbx_nonpoly_scheme.auth_seq_num vs. _atom_site.auth_seq_id #98

Closed bienchen closed 1 year ago

bienchen commented 1 year ago

When dealing with multiple water molecules in a single mmCIF file, auth_seq_id items seem off in the _atom_site category. For _pdbx_nonpoly_scheme.auth_seq_num, water molecules are numbered 1-n, which seems to be correct. For _atom_site.auth_seq_id, water molecules are always 1 while it should be the corresponding value in _pdbx_nonpoly_scheme.auth_seq_num. The RCSB validation tool CifCheck reports this as an error in the file.

Here is a script, producing a mmCIF file with this issue:

import ihm.dumper
import ihm.model
import ihm.protocol
import ihm.representation

system = ihm.System()

entity_h2o = ihm.Entity(
    [ihm.WaterChemComp(), ihm.WaterChemComp(), ihm.WaterChemComp()],
    description="Water",
)
system.entities.append(entity_h2o)

asym_h2o = ihm.AsymUnit(entity_h2o, details="Waters")
system.asym_units.append(asym_h2o)

assembly = ihm.Assembly((asym_h2o,), "Water assembly")

class WaterModel(ihm.model.Model):
    other_details = None

    def get_atoms(self):
        yield ihm.model.Atom(
            asym_unit=asym_h2o,
            type_symbol="O",
            het=True,
            seq_id=None,
            atom_id="O",
            x=10.0,
            y=10.0,
            z=10.0,
        )
        yield ihm.model.Atom(
            asym_unit=asym_h2o,
            type_symbol="O",
            het=True,
            seq_id=None,
            atom_id="O",
            x=20.0,
            y=20.0,
            z=20.0,
        )
        yield ihm.model.Atom(
            asym_unit=asym_h2o,
            type_symbol="O",
            het=True,
            seq_id=None,
            atom_id="O",
            x=30.0,
            y=30.0,
            z=30.0,
        )

rep = ihm.representation.Representation(
    [ihm.representation.AtomicSegment(asym_h2o, rigid=False)]
)
protocol = ihm.protocol.Protocol(name="Modelling")

model = WaterModel(
    assembly=assembly, protocol=protocol, name="Water model", representation=rep
)
model_group = ihm.model.ModelGroup([model], name="All models")
state = ihm.model.State([model_group])
system.state_groups.append(ihm.model.StateGroup([state]))

with open("water_only.cif", "w", encoding="ascii") as mmcif_fh:
    ihm.dumper.write(mmcif_fh, [system])

From the produced file water_only.cif, categroies _pdbx_nonpoly_scheme and _atom_site are the interesting ones. _pdbx_nonpoly_scheme.auth_seq_num has values 1, 2, 3 while _atom_site.auth_seq_id is 1 for every HOH but should reflect the 1, 2, 3 as in _pdbx_nonpoly_scheme.auth_seq_num.

benmwebb commented 1 year ago

So the issue here is that in python-ihm the auth_seq_id is set in the auth_seq_id_map argument to AsymUnit. This maps a sequence index (1,2,3,..., which for polymers is the same as seq_id) to auth_seq_id. The problem is that when we come to handle Atom objects for nonpolymers, it can't tell what the sequence index is because they all have seq_id = None. Currently it just assumes 1 which works OK for ligands, where you generally only have one per asym, but not when you have multiple waters.

One solution here would be to require that an integer seq_id be assigned to water Atom objects at construction time, just as is done for polymer atoms. This would make things unambiguous. To avoid breaking existing code, seq_id=None could be treated as equivalent to seq_id=1. The downside to this approach is that the dictionary uses . for nonpolymer seq_id of course, so we'd have to discard it when we output the table.

An alternative would be to have the atom_site dumper ignore the Atom seq_id for nonpolymers and assign them sequential IDs as it sees them. This wouldn't work of course if you defined an entity containing three waters and then had Model.get_atoms return [third_water, first_water, second_water] or even [first_water, third_water] but maybe nobody would be crazy enough to do that.

bienchen commented 1 year ago

Thinking of template based modelling, using templates from the PDB, the author assigned seq_id's for water are not always sequential starting at 1. E.g. in entry 5BS8, it looks like water molecules are together in chains with the polymers by author annotation (auth_asym_id, auth_seq_id). So numbering of water molecules starts at any number. Also, while modelling a user may decide to only keep water that mediates ligand interactions and discard all other water. So I think beside having working seq_id's for water, they also should work with auth_seq_id_map. Otherwise discussions of models referring to the original template may get pretty awkward. That makes a solution where I need to set seq_id for water and seq_id=None becomes 1 look like the best option to me atm, if that also helps to get auth_seq_id_map to work.

bienchen commented 1 year ago

Thanks a lot! I tested the fix, mmCIF files produced now do pass CifCheck.