Closed benmwebb closed 6 years ago
One proposal: add to the ihm_modeling_post_process
category optional struct_assembly_id
and dataset_group_id
so we can see what is being filtered or rescored and how, respectively. Also add validation
to the ihm_modeling_post_process.type
enumeration, or maybe add a separate validation table. Datasets for filtering or validation are then clear based on where they are used in this table. Thoughts?
Can you give a specific example of what you are meaning by this?
In HADDOCK terms for example, would this describe our clustering process? I.e. that we only at the end consider models that do cluster (and typically discard isolated ones).
Can you give a specific example of what you are meaning by this?
I thought I'd done that when I opened the issue, but I'll rephrase: given N input experiments, one could convert each of them into restraints and optimize a scoring function that's the sum of all N. This is perfectly captured by the current dictionary. Alternatively, one could convert N-x inputs to restraints and optimize, and rescore the final models against the left-out x inputs, the idea being that if the model satisfies data that wasn't used in its construction, it is more likely to be correct. Put another way, the dictionary doesn't currently state which data we used for the training set and which for the test set. It may be useful for PDB to know this in order to facilitate their own validation of the models.
In HADDOCK terms for example, would this describe our clustering process?
AFAIK the existing ihm_modeling_post_process
category would handle that just fine.
@benmwebb This has been addressed in the latest dictionary update.
Information can be used in integrative/hybrid modeling in several ways:
Currently if we list a dataset in the mmCIF file, the unwritten assumption is that for most experimental data, the information is used as a restraint, while for starting models, the information is used to choose a representation. It may be useful to explicitly say which datasets were used for validation, for example, as that information may be useful to validate the mmCIF model itself.