Indicate how a dataset is used (as restraint, for validation, etc.)

benmwebb commented 6 years ago

Information can be used in integrative/hybrid modeling in several ways:

Restraint; for example, a crosslink can be represented as a harmonic score term that is minimized during computational sampling.
Filter; for example, a pool of generated models can be narrowed by keeping only those that are consistent with one or more experiments.
Validation; the data can be withheld from the modeling and used only to validate the final set of structures.
Representation; the nature of the data can determine how the system is represented, for example an available subunit or interface crystal structure can be used as a starting model, and perhaps as a rigid body.
Sampling; for example a Monte Carlo move set can be chosen to only explore conformations that are consistent with the data.

Currently if we list a dataset in the mmCIF file, the unwritten assumption is that for most experimental data, the information is used as a restraint, while for starting models, the information is used to choose a representation. It may be useful to explicitly say which datasets were used for validation, for example, as that information may be useful to validate the mmCIF model itself.

benmwebb commented 6 years ago

One proposal: add to the ihm_modeling_post_process category optional struct_assembly_id and dataset_group_id so we can see what is being filtered or rescored and how, respectively. Also add validation to the ihm_modeling_post_process.type enumeration, or maybe add a separate validation table. Datasets for filtering or validation are then clear based on where they are used in this table. Thoughts?

amjjbonvin commented 6 years ago

Can you give a specific example of what you are meaning by this?

In HADDOCK terms for example, would this describe our clustering process? I.e. that we only at the end consider models that do cluster (and typically discard isolated ones).

benmwebb commented 6 years ago

Can you give a specific example of what you are meaning by this?

I thought I'd done that when I opened the issue, but I'll rephrase: given N input experiments, one could convert each of them into restraints and optimize a scoring function that's the sum of all N. This is perfectly captured by the current dictionary. Alternatively, one could convert N-x inputs to restraints and optimize, and rescore the final models against the left-out x inputs, the idea being that if the model satisfies data that wasn't used in its construction, it is more likely to be correct. Put another way, the dictionary doesn't currently state which data we used for the training set and which for the test set. It may be useful for PDB to know this in order to facilitate their own validation of the models.

In HADDOCK terms for example, would this describe our clustering process?

AFAIK the existing ihm_modeling_post_process category would handle that just fine.

brindakv commented 6 years ago

@benmwebb This has been addressed in the latest dictionary update.

ihmwg / IHMCIF

Indicate how a dataset is used (as restraint, for validation, etc.) #56