Open benmwebb opened 5 years ago
We can split the file in two - a "static" section and a "dynamic" section. The static section can use a BinaryCIF-like encoding (binary, compact, stuffed into a MessagePack blob). This would cover categories such as Entity and would expected to be read completely into memory when the file is parsed. The dynamic section would be a set of variable-size vectors of fixed size records containing fields that are omitted from the static section (the set of fields, such as Cartn_x
, Cartn_y
, Cartn_z
would be defined at runtime in a separate table in the static section).
To read a file (for example), atoms would first be constructed with dummy coordinates by reading the atom_site
table in the static section. Then a given model_id
would be looked up in the dynamic section, and coordinates streamed in. The former step may only need to be done once; the latter step should be fast because the records are of fixed size (natoms * 3 floats) and the data is contiguous.
If model compositions can vary (e.g. in multistate modeling) multiple such static-dynamic pairs (one per composition) will be needed.
Here is a proof of concept for storing mmCIF tables in an HDF5 file. This could be used to store an entire entry, or it could be used to store just the trajectory information (atom_site
and/or ihm_sphere_obj_site
) as is currently done for DCD. The file format is fairly simple:
loop_
is mapped to a 1D HDF5 table of a compound data type; the name of the table matches the name of the category, and each row in the table maps to a row in the loop. The compound data type includes fields for each data item of suitable type (int, float, string). Fixed-length HDF5 strings should be used where possible (the length will be that of the longest string data item, null-padded)..
or ?
values, the compound data type can also include a cif_omitted_unknown
field which is an array of bytes. If the nth byte in this array is .
or ?
the value for the nth data item is ignored and instead treated as mmCIF omitted or unknown respectively. (A bitfield, as is done in BinaryCIF, may be more efficient but less readable.).
.loops
attribute which is a compound datatype containing N-1 integer elements. Each element should named after a data item and its value should be the minimum value of a loop counter.loops
data items. All tables of lower dimension are enumerated multiple times (they must match the shape of the higher-dimensional tables). The names of the tables are unimportant.HDF5 groups allow for combining "static" and "dynamic" information so that trajectories can be efficiently stored. For example, the following creates an HDF5 file containing the atom_site
table. Model x/y/z coordinates for two models are stored separately from the topology information (which is the same for each model):
import h5py
f = h5py.File("test.hdf5", "w")
g = f.create_group("atom_site")
constant_items = g.create_dataset(
"constant_items", (4,),
dtype=[('group_PDB', 'S6'),
('type_symbol', 'S2'),
('label_atom_id', 'S4'),
('label_alt_id', 'S4'),
('label_comp_id', 'S4'),
('label_asym_id', 'S4'),
('label_seq_id', '<i4'),
('auth_seq_id', '<i4'),
('pdbx_PDB_ins_code', 'S4'),
('occupancy', '<f4'),
('cif_omitted_unknown', 'S10')])
constant_items[0] = ('ATOM', 'N', 'N', '', 'GLY', 'A', 1, 1, '', 1.0, ' . ? ')
constant_items[1] = ('ATOM', 'C', 'CA', '', 'GLY', 'A', 1, 1, '', 1.0, ' . ? ')
constant_items[2] = ('ATOM', 'C', 'C', '', 'GLY', 'A', 1, 1, '', 1.0, ' . ? ')
constant_items[3] = ('ATOM', 'O', 'O', '', 'GLY', 'A', 1, 1, '', 1.0, ' . ? ')
loop_items = g.create_dataset(
"loop_items", (2, 4),
dtype=[('Cartn_x', '<f4'),
('Cartn_y', '<f4'),
('Cartn_z', '<f4')])
loop_items.attrs.create(name='loops', dtype=[('pdbx_PDB_model_num', 'i4')], data=[10])
loop_items[0] = [(1,2,3), (4,5,6), (7,8,9), (10,11,12)]
loop_items[1] = [(11,12,13), (14,15,16), (17,18,19), (20,21,22)]
f.close()
To construct the equivalent mmCIF representation, the loop_items
dataset is iterated over 2 models and 4 atoms in each model. Every atom in the first model is assigned pdbx_PDB_model_num=10
, and every atom in the second model, 11. For each model, the constant_items
dataset is also iterated over to provide the topology information. i.e. for atom B in model A, atom_site.Cartn_x
is taken from loop_items[A][B]
, atom_site.group_PDB
is taken from constant_items[B]
and atom_site.pdbx_PDB_model_num
is loop_items.attrs['loops'][0] + A
.
Note that this does not fill in atom_site.id
.
If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.
If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups.
Also, no reason why the tables couldn't overlap and cover the same model numbers, e.g. if some subset of the system optimizes just x,y,z coordinates and some other subset also varies the radii. What this current proposal does not (efficiently) handle though, would be models where some subset of the atoms remain fixed (and thus have duplicated x,y,z coordinates in each model).
This could be made more flexible by modifying the loops
attribute to specify a max value and a dimension (and perhaps renaming it to something more descriptive, like auto_increment
) such that
ihm_model_id
and pdbx_PDB_model_num
for atom_site
)HDF5 supports enumerated types. These could be used to more efficiently (using 1- or 2-byte integers per data item) store strings like ATOM/HETATM or residue/atom/element names, and could also cover mmCIF ?/. special values.
Since the intention is to provide only a subset of mmCIF tables in HDF5 (e.g. atom_site
) and link back to a main mmCIF file, only enough information needs to be provided in HDF5 to unambiguously map back. Thus, for atom_site
author-provided numbers and insertion codes are not required, and so in most cases probably support for ?/. special values is not needed.
@tomgoddard suggests that we also consider Zarr as an HDF5 alternative (although their support for C++ and other non-Python languages is still minimal). Also probably a good idea to get buy-in from other producers of ensembles (e.g. Rosetta, Haddock) and visualizers (e.g. Mol*).
Add support for read/write of a file format that (ideally)
Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.
Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.
MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the
ATOM
/HETATM
field in theatom_site
table necessitates reading and unpacking the entire thing to determine whether a given atom isATOM
orHETATM
).Fast parsing probably necessitates a binary format.
Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can
ATOM
orHETATM
) with a suitably small integer (just a 0 or 1 bool type in this case).