Add a new mmCIF-compatible binary format, e.g. for trajectories

Add support for read/write of a file format that (ideally)

supports arbitrary mmCIF key:value pairs (so that we can convert it loss-free to or from mmCIF, including with custom data as per #30)
is fast to parse
is compact
is seekable (so that we can easily add extra frames in a trajectory, or quickly access frames from the middle of an existing trajectory)

Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.

Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.

MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the ATOM/HETATM field in the atom_site table necessitates reading and unpacking the entire thing to determine whether a given atom is ATOM or HETATM).

Fast parsing probably necessitates a binary format.

Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can

replace a lot of string data with more compact forms - for example replace any floating-point number with a 32-bit float, or enumerated data (e.g. ATOM or HETATM) with a suitably small integer (just a 0 or 1 bool type in this case).
split out changing and non-changing data per ID - for example several models may have the same composition, so we only need to store coordinates for each model, and a single table with the composition.

We can split the file in two - a "static" section and a "dynamic" section. The static section can use a BinaryCIF-like encoding (binary, compact, stuffed into a MessagePack blob). This would cover categories such as Entity and would expected to be read completely into memory when the file is parsed. The dynamic section would be a set of variable-size vectors of fixed size records containing fields that are omitted from the static section (the set of fields, such as Cartn_x, Cartn_y, Cartn_z would be defined at runtime in a separate table in the static section).

To read a file (for example), atoms would first be constructed with dummy coordinates by reading the atom_site table in the static section. Then a given model_id would be looked up in the dynamic section, and coordinates streamed in. The former step may only need to be done once; the latter step should be fast because the records are of fixed size (natoms * 3 floats) and the data is contiguous.

If model compositions can vary (e.g. in multistate modeling) multiple such static-dynamic pairs (one per composition) will be needed.

Here is a proof of concept for storing mmCIF tables in an HDF5 file. This could be used to store an entire entry, or it could be used to store just the trajectory information (atom_site and/or ihm_sphere_obj_site) as is currently done for DCD. The file format is fairly simple:

Each mmCIF loop_ is mapped to a 1D HDF5 table of a compound data type; the name of the table matches the name of the category, and each row in the table maps to a row in the loop. The compound data type includes fields for each data item of suitable type (int, float, string). Fixed-length HDF5 strings should be used where possible (the length will be that of the longest string data item, null-padded).
Since HDF5 data types cannot represent mmCIF . or ? values, the compound data type can also include a cif_omitted_unknown field which is an array of bytes. If the nth byte in this array is . or ? the value for the nth data item is ignored and instead treated as mmCIF omitted or unknown respectively. (A bitfield, as is done in BinaryCIF, may be more efficient but less readable.)
As with mmCIF, if a given data item isn't described in HDF5, it gets the value ..
A non-looped mmCIF category is treated identically to a loop containing only a single row.
If the file contains any HDF5 groups:
- The name of the group matches the name of an mmCIF category.
- Any table of dimension N > 1 must contain a loops attribute which is a compound datatype containing N-1 integer elements. Each element should named after a data item and its value should be the minimum value of a loop counter.
- mmCIF loop rows correspond to enumerating all HDF5 tables in this category simultaneously and combining information from them using the shape of the highest-dimension table, and adding any loops data items. All tables of lower dimension are enumerated multiple times (they must match the shape of the higher-dimensional tables). The names of the tables are unimportant.

HDF5 groups allow for combining "static" and "dynamic" information so that trajectories can be efficiently stored. For example, the following creates an HDF5 file containing the atom_site table. Model x/y/z coordinates for two models are stored separately from the topology information (which is the same for each model):

import h5py

f = h5py.File("test.hdf5", "w")
g = f.create_group("atom_site")

constant_items = g.create_dataset(
    "constant_items", (4,),
    dtype=[('group_PDB', 'S6'),
           ('type_symbol', 'S2'),
           ('label_atom_id', 'S4'),
           ('label_alt_id', 'S4'),
           ('label_comp_id', 'S4'),
           ('label_asym_id', 'S4'),
           ('label_seq_id', '<i4'),
           ('auth_seq_id', '<i4'),
           ('pdbx_PDB_ins_code', 'S4'),
           ('occupancy', '<f4'),
           ('cif_omitted_unknown', 'S10')])
constant_items[0] = ('ATOM', 'N', 'N', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[1] = ('ATOM', 'C', 'CA', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[2] = ('ATOM', 'C', 'C', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')
constant_items[3] = ('ATOM', 'O', 'O', '', 'GLY', 'A', 1, 1, '', 1.0, '   .    ? ')

loop_items = g.create_dataset(
    "loop_items", (2, 4),
    dtype=[('Cartn_x', '<f4'),
           ('Cartn_y', '<f4'),
           ('Cartn_z', '<f4')])
loop_items.attrs.create(name='loops', dtype=[('pdbx_PDB_model_num', 'i4')], data=[10])
loop_items[0] = [(1,2,3), (4,5,6), (7,8,9), (10,11,12)]
loop_items[1] = [(11,12,13), (14,15,16), (17,18,19), (20,21,22)]

f.close()

To construct the equivalent mmCIF representation, the loop_items dataset is iterated over 2 models and 4 atoms in each model. Every atom in the first model is assigned pdbx_PDB_model_num=10, and every atom in the second model, 11. For each model, the constant_items dataset is also iterated over to provide the topology information. i.e. for atom B in model A, atom_site.Cartn_x is taken from loop_items[A][B], atom_site.group_PDB is taken from constant_items[B] and atom_site.pdbx_PDB_model_num is loop_items.attrs['loops'][0] + A.

Note that this does not fill in atom_site.id.

If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.

If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.

Also, no reason why the tables couldn't overlap and cover the same model numbers, e.g. if some subset of the system optimizes just x,y,z coordinates and some other subset also varies the radii. What this current proposal does not (efficiently) handle though, would be models where some subset of the atoms remain fixed (and thus have duplicated x,y,z coordinates in each model).

This could be made more flexible by modifying the loops attribute to specify a max value and a dimension (and perhaps renaming it to something more descriptive, like auto_increment) such that

multiple data items can be incremented for the same dimension (e.g. both ihm_model_id and pdbx_PDB_model_num for atom_site)
dimension could be outside the range of the dataset's shape to repeat a table (e.g. fixed atom coordinates).

HDF5 supports enumerated types. These could be used to more efficiently (using 1- or 2-byte integers per data item) store strings like ATOM/HETATM or residue/atom/element names, and could also cover mmCIF ?/. special values.

Since the intention is to provide only a subset of mmCIF tables in HDF5 (e.g. atom_site) and link back to a main mmCIF file, only enough information needs to be provided in HDF5 to unambiguously map back. Thus, for atom_site author-provided numbers and insertion codes are not required, and so in most cases probably support for ?/. special values is not needed.

@tomgoddard suggests that we also consider Zarr as an HDF5 alternative (although their support for C++ and other non-Python languages is still minimal). Also probably a good idea to get buy-in from other producers of ensembles (e.g. Rosetta, Haddock) and visualizers (e.g. Mol*).

ihmwg / python-ihm

Add a new mmCIF-compatible binary format, e.g. for trajectories #31