Add support for ensemble subsamples

In IMP we often construct a model ensemble by combining data from multiple independent runs. This allows us to determine if sampling was sufficient by comparing the runs (if sampling is not complete, each run will sample a different part of the conformational space). Unfortunately the existing ihm_ensemble_info category only allows a single file (ensemble_file_id) to be referenced. We could combine our independent runs and deposit a single DCD file, but that would lose data. We would rather deposit a separate DCD file for each run. This allows the model to be validated by rerunning the sampling convergence test.

An example of where this is currently done is PDB-Dev 37. ensemble_id 1 references external file 4, which is Ensemble_DCD/A_CSN.dcd, but this contains only half of the ensemble structures (subsample A). We also deposited Ensemble_DCD/B_CSN.dcd as external file 79 (subsample B) but this is not referenced from the ensemble table.

Proposal: add a subsample and a subsample-group table. The subsample table contains

the name of the subsample
the subsample group to which it belongs
the number of structures in the subsample
a model_group_id for the structures (optional, not used for IMP structures)
an external file_id

The subsample group table contains

the name of the group
the ensemble to which it belongs (an ensemble could contain multiple groups)
the type of the grouping, an enumeration
- RANDOM: subsamples were generated by randomly partitioning all structures in the group
- INDEPENDENT: each subsample was generated in the same fashion (the ensemble's post_process_id) but in independent simulations

Visualization in ChimeraX would likely proceed by adding the subsamples as child nodes of the ensemble and having separate coordsets for each one.

ihmwg / IHMCIF

Add support for ensemble subsamples #80