In IMP we often construct a model ensemble by combining data from multiple independent runs. This allows us to determine if sampling was sufficient by comparing the runs (if sampling is not complete, each run will sample a different part of the conformational space). Unfortunately the existing ihm_ensemble_info category only allows a single file (ensemble_file_id) to be referenced. We could combine our independent runs and deposit a single DCD file, but that would lose data. We would rather deposit a separate DCD file for each run. This allows the model to be validated by rerunning the sampling convergence test.
An example of where this is currently done is PDB-Dev 37. ensemble_id 1 references external file 4, which is Ensemble_DCD/A_CSN.dcd, but this contains only half of the ensemble structures (subsample A). We also deposited Ensemble_DCD/B_CSN.dcd as external file 79 (subsample B) but this is not referenced from the ensemble table.
Proposal: add a subsample and a subsample-group table. The subsample table contains
the name of the subsample
the subsample group to which it belongs
the number of structures in the subsample
a model_group_id for the structures (optional, not used for IMP structures)
an external file_id
The subsample group table contains
the name of the group
the ensemble to which it belongs (an ensemble could contain multiple groups)
the type of the grouping, an enumeration
RANDOM: subsamples were generated by randomly partitioning all structures in the group
INDEPENDENT: each subsample was generated in the same fashion (the ensemble's post_process_id) but in independent simulations
Visualization in ChimeraX would likely proceed by adding the subsamples as child nodes of the ensemble and having separate coordsets for each one.
In IMP we often construct a model ensemble by combining data from multiple independent runs. This allows us to determine if sampling was sufficient by comparing the runs (if sampling is not complete, each run will sample a different part of the conformational space). Unfortunately the existing
ihm_ensemble_info
category only allows a single file (ensemble_file_id
) to be referenced. We could combine our independent runs and deposit a single DCD file, but that would lose data. We would rather deposit a separate DCD file for each run. This allows the model to be validated by rerunning the sampling convergence test.An example of where this is currently done is PDB-Dev 37.
ensemble_id
1 references external file 4, which isEnsemble_DCD/A_CSN.dcd
, but this contains only half of the ensemble structures (subsample A). We also depositedEnsemble_DCD/B_CSN.dcd
as external file 79 (subsample B) but this is not referenced from the ensemble table.Proposal: add a subsample and a subsample-group table. The subsample table contains
model_group_id
for the structures (optional, not used for IMP structures)file_id
The subsample group table contains
RANDOM
: subsamples were generated by randomly partitioning all structures in the groupINDEPENDENT
: each subsample was generated in the same fashion (the ensemble'spost_process_id
) but in independent simulationsVisualization in ChimeraX would likely proceed by adding the subsamples as child nodes of the ensemble and having separate coordsets for each one.