ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Add a new mmCIF-compatible binary format, e.g. for trajectories #31

Open benmwebb opened 5 years ago

benmwebb commented 5 years ago

Add support for read/write of a file format that (ideally)

  1. supports arbitrary mmCIF key:value pairs (so that we can convert it loss-free to or from mmCIF, including with custom data as per #30)
  2. is fast to parse
  3. is compact
  4. is seekable (so that we can easily add extra frames in a trajectory, or quickly access frames from the middle of an existing trajectory)

Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.

Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.

MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the ATOM/HETATM field in the atom_site table necessitates reading and unpacking the entire thing to determine whether a given atom is ATOM or HETATM).

Fast parsing probably necessitates a binary format.

Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can

benmwebb commented 5 years ago

We can split the file in two - a "static" section and a "dynamic" section. The static section can use a BinaryCIF-like encoding (binary, compact, stuffed into a MessagePack blob). This would cover categories such as Entity and would expected to be read completely into memory when the file is parsed. The dynamic section would be a set of variable-size vectors of fixed size records containing fields that are omitted from the static section (the set of fields, such as Cartn_x, Cartn_y, Cartn_z would be defined at runtime in a separate table in the static section).

To read a file (for example), atoms would first be constructed with dummy coordinates by reading the atom_site table in the static section. Then a given model_id would be looked up in the dynamic section, and coordinates streamed in. The former step may only need to be done once; the latter step should be fast because the records are of fixed size (natoms * 3 floats) and the data is contiguous.

If model compositions can vary (e.g. in multistate modeling) multiple such static-dynamic pairs (one per composition) will be needed.

benmwebb commented 1 year ago

Here is a proof of concept for storing mmCIF tables in an HDF5 file. This could be used to store an entire entry, or it could be used to store just the trajectory information (atom_site and/or ihm_sphere_obj_site) as is currently done for DCD. The file format is fairly simple:

HDF5 groups allow for combining "static" and "dynamic" information so that trajectories can be efficiently stored. For example, the following creates an HDF5 file containing the atom_site table. Model x/y/z coordinates for two models are stored separately from the topology information (which is the same for each model):

import h5py

f = h5py.File("test.hdf5", "w")
g = f.create_group("atom_site")

constant_items = g.create_dataset(
    "constant_items", (4,),
    dtype=[('group_PDB', 'S6'),
           ('type_symbol', 'S2'),
           ('label_atom_id', 'S4'),
           ('label_alt_id', 'S4'),
           ('label_comp_id', 'S4'),
           ('label_asym_id', 'S4'),
           ('label_seq_id', '<i4'),
           ('auth_seq_id', '<i4'),
           ('pdbx_PDB_ins_code', 'S4'),
           ('occupancy', '<f4'),
           ('cif_omitted_unknown', 'S10')])
constant_items[0] = ('ATOM', 'N', 'N', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[1] = ('ATOM', 'C', 'CA', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[2] = ('ATOM', 'C', 'C', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[3] = ('ATOM', 'O', 'O', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')

loop_items = g.create_dataset(
    "loop_items", (2, 4),
    dtype=[('Cartn_x', '<f4'),
           ('Cartn_y', '<f4'),
           ('Cartn_z', '<f4')])
loop_items.attrs.create(name='loops', dtype=[('pdbx_PDB_model_num', 'i4')], data=[10])
loop_items[0] = [(1,2,3), (4,5,6), (7,8,9), (10,11,12)]
loop_items[1] = [(11,12,13), (14,15,16), (17,18,19), (20,21,22)]

f.close()

To construct the equivalent mmCIF representation, the loop_items dataset is iterated over 2 models and 4 atoms in each model. Every atom in the first model is assigned pdbx_PDB_model_num=10, and every atom in the second model, 11. For each model, the constant_items dataset is also iterated over to provide the topology information. i.e. for atom B in model A, atom_site.Cartn_x is taken from loop_items[A][B], atom_site.group_PDB is taken from constant_items[B] and atom_site.pdbx_PDB_model_num is loop_items.attrs['loops'][0] + A.

Note that this does not fill in atom_site.id.

If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.

benmwebb commented 1 year ago

If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.

Also, no reason why the tables couldn't overlap and cover the same model numbers, e.g. if some subset of the system optimizes just x,y,z coordinates and some other subset also varies the radii. What this current proposal does not (efficiently) handle though, would be models where some subset of the atoms remain fixed (and thus have duplicated x,y,z coordinates in each model).

benmwebb commented 1 year ago

This could be made more flexible by modifying the loops attribute to specify a max value and a dimension (and perhaps renaming it to something more descriptive, like auto_increment) such that

HDF5 supports enumerated types. These could be used to more efficiently (using 1- or 2-byte integers per data item) store strings like ATOM/HETATM or residue/atom/element names, and could also cover mmCIF ?/. special values.

Since the intention is to provide only a subset of mmCIF tables in HDF5 (e.g. atom_site) and link back to a main mmCIF file, only enough information needs to be provided in HDF5 to unambiguously map back. Thus, for atom_site author-provided numbers and insertion codes are not required, and so in most cases probably support for ?/. special values is not needed.

@tomgoddard suggests that we also consider Zarr as an HDF5 alternative (although their support for C++ and other non-Python languages is still minimal). Also probably a good idea to get buy-in from other producers of ensembles (e.g. Rosetta, Haddock) and visualizers (e.g. Mol*).