salilab / imp

The Integrative Modeling Platform
https://integrativemodeling.org
GNU General Public License v3.0
72 stars 30 forks source link

Track provenance information for all modeling inputs #976

Open benmwebb opened 7 years ago

benmwebb commented 7 years ago

IMP currently takes as input files in a variety of formats, but doesn't care where those files originate. This becomes a problem when we come to publish a modeling study and deposit the files (e.g. at PDB-dev). It's a lot of work to backtrack and try to figure out where such files came from. It would be much simpler if IMP tracked this information from day one, reading it in some standardized way from the files themselves (or the Python script), storing it in the Model, and also storing it in RMF files.

Since this is prerequisite information for outputting mmCIF files, solving this issue would be a step towards addressing #968. Much of this information is currently stored outside of the Model, mostly in PMI 1 data structures, and so currently outputting mmCIF requires PMI 1.

Only input atomic models are explicitly considered here but similar considerations should apply to restraints (e.g. where an EM map comes from), sequences (e.g. uniprot identifier), etc. (More generally, any transformation of the model, such as sampling, filtering or clustering, should also be recorded.)

Input files

Storage in Model

Storage in RMF

benmwebb commented 6 years ago

For tracking provenance of most experimental information, some additional information needs to be stored in the RMF file, namely the set of restraints, which particles they act on, and which restraints were used in each sampling step.

Proposal: RMF already stores basic information about decomposed restraints. Make each set of decomposed restraints children of the 'real' restraint, which holds serialized information on the restraint itself (e.g. filename where the EM map was read from, cross correlation information, total score). The SampleProvenance decorator then contains an RMF Alias node child for each restraint used in that sampling. This information is already stored in IMP (partly in the Model, and partly in the ScoringFunction.)