ihmwg / IHMCIF

📖 mmCIF support for hybrid/integrative models
https://pdb-dev.wwpdb.org
Creative Commons Zero v1.0 Universal
21 stars 3 forks source link

Indicate how a dataset is used (as restraint, for validation, etc.) #56

Closed benmwebb closed 6 years ago

benmwebb commented 6 years ago

Information can be used in integrative/hybrid modeling in several ways:

Currently if we list a dataset in the mmCIF file, the unwritten assumption is that for most experimental data, the information is used as a restraint, while for starting models, the information is used to choose a representation. It may be useful to explicitly say which datasets were used for validation, for example, as that information may be useful to validate the mmCIF model itself.

benmwebb commented 6 years ago

One proposal: add to the ihm_modeling_post_process category optional struct_assembly_id and dataset_group_id so we can see what is being filtered or rescored and how, respectively. Also add validation to the ihm_modeling_post_process.type enumeration, or maybe add a separate validation table. Datasets for filtering or validation are then clear based on where they are used in this table. Thoughts?

amjjbonvin commented 6 years ago

Can you give a specific example of what you are meaning by this?

In HADDOCK terms for example, would this describe our clustering process? I.e. that we only at the end consider models that do cluster (and typically discard isolated ones).

benmwebb commented 6 years ago

Can you give a specific example of what you are meaning by this?

I thought I'd done that when I opened the issue, but I'll rephrase: given N input experiments, one could convert each of them into restraints and optimize a scoring function that's the sum of all N. This is perfectly captured by the current dictionary. Alternatively, one could convert N-x inputs to restraints and optimize, and rescore the final models against the left-out x inputs, the idea being that if the model satisfies data that wasn't used in its construction, it is more likely to be correct. Put another way, the dictionary doesn't currently state which data we used for the training set and which for the test set. It may be useful for PDB to know this in order to facilitate their own validation of the models.

In HADDOCK terms for example, would this describe our clustering process?

AFAIK the existing ihm_modeling_post_process category would handle that just fine.

brindakv commented 6 years ago

@benmwebb This has been addressed in the latest dictionary update.