Extension of ModelCIF for AF3 quality estimates

Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for https://github.com/chaidiscovery/chai-lab/issues/52. Here is my suggested additions:

Extend _ma_qa_metric.type to include:
- "pLDDT to polymer" with detailed description "confidence score predicting accuracy according to lDDT with distances from each atom to CA or C1' of nearby polymer residues in [0,100]"
- "boolean" with detailed description "0 or 1 depending on whether a check passed (1) or not (0)."
Extend _ma_qa_metric.mode to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)
New _ma_qa_metric_per_chain same as _ma_qa_metric_local but without label_comp_id and label_seq_id
New _ma_qa_metric_per_chain_pairwise same as _ma_qa_metric_local_pairwise but without label_comp_id* and label_seq_id*
New _ma_qa_metric_per_atom same as _ma_qa_metric_local but using atom_id (linked to _atom_site.id) instead of model_id and label_*
New _ma_qa_metric_per_atom_pairwise same as _ma_qa_metric_local_pairwise but but using atom_id_1 and atom_id_2 (linked to _atom_site.id) instead of model_id and label_*

Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a _ma_qa_metric.mode and .type:

fraction_disordered: "global", "normalized score"
has_clash: "global", "boolean"
iptm: "global", "ipTM"
ptm: "global", "pTM"
ranking_score: "global", "normalized score"
chain_ptm: "per-chain", "pTM"
chain_iptm: "per-chain", "ipTM"
chain_pair_iptm: "per-chain-pairwise", "ipTM"
chain_pair_pae_min: "per-chain-pairwise", "PAE"
atom_plddts: "per-atom", "pLDDT to polymer"
contact_probs: "per-atom-pairwise", "contact probability"
pae: "per-atom-pairwise", "PAE"

Some caveats to consider:

contact_probs and pae above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.
The "per-chain" scores also apply to non-polymers which may be a confusing naming. Technically "per-asym-id" is more correct although that may be only understandable by mmCIF experts.
For future applications in physics-based docking tools, we need to make sure that local scores can identify water molecules. In PDB those all share label_asym_id and do not have a label_seq_id and one could also change that to giving them separate label_asym_id in ModelCIF to fix this.

Alternative to the above (which simplifies some things and handles the per token scores):

Extend _ma_qa_metric_local and _ma_qa_metric_local_pairwise to include label_atom_id (linked to _atom_site.label_atom_id) which can be set to '.' for per-residue scores.
One could also handle per-chain scores by allowing label_comp_id and label_seq_id to be set to '.'.
With appropriate updates to the category and item descriptions, all types of local scores could be handled by the _ma_qa_metric_local and _ma_qa_metric_local_pairwise tables and no additional tables or _ma_qa_metric.mode values would be necessary.

@brindakv what are your thoughts on this?

Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):

Not good to add link to _atom_site.label_atom_id to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3)
Alternative discarded suggestion was to link to _atom_site.id with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick).
Preferred solution is to use features as in IHM's _ihm_feature_list

Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:

fold_test_fold_job_number_one_job_request.json is input to AF3 (can be uploaded to the AF-Server)
fold_test_fold_job_number_one_model_0.cif is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).
fold_test_fold_job_number_one_summary_confidences_0.json contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".
fold_test_fold_job_number_one_full_data_0.json contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.
Chains in the model:
- A: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)
- B: polymer (polypeptide; seq: "REACHER")
- C: non-polymer (ATP)
- D: non-polymer (HEM)
- E: non-polymer (HEM)
- F: non-polymer (MG)
- G: non-polymer (MG)
- H: non-polymer (NA)
- I: non-polymer (NA)
- J: non-polymer (NA)
- K: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)
- L: polymer (polydeoxyribonucleotide; seq: "TGTAATC")
- M: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)
- N: branched (NAG-NAG-BMA)
- O: branched (BMA)

Suggested ModelCIF extension:

Extend _ma_qa_metric.type as in first comment
Extend _ma_qa_metric.mode to include "per-feature" and "per-feature-pair"
New _ma_feature_list exactly like _ihm_feature_list except "branched" added to entity_type and feature_type which should include the following controlled vocabulary:
- atom: "feature is an atom or a set of atoms for any entity type"
- residue: "feature is a residue or a set of residues from a polymeric entity"
- asym_id: "feature is an instance of a molecular entity"
New _ma_atom_feature category:
- Description: "Data items in this category provide the definitions required to select specific atoms independently of entity type."
- Items:
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- atom_id (mandatory): "The identifier of the atom. This data item is a pointer to _atom_site.id in the ATOM_SITE category."
New _ma_poly_residue_feature category:
- Description: "Data items in this category provide the definitions required to select specific polymer residues."
- Items (similar to ma_qa_metric_local):
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
- label_comp_id (mandatory): "The component identifier for the residue in the structural model. This data item is a pointer to _atom_site.label_comp_id in the ATOM_SITE category."
- label_seq_id (mandatory): "The identifier for the sequence index of the residue in the structural model. This data item is a pointer to _atom_site.label_seq_id in the ATOM_SITE category."
New _ma_asym_id_feature category:
- Description: "Data items in this category provide the definitions required to select specific instances of a molecular entity independently of entity type (e.g. a polymer chain or a copy of a non-polymer)."
- Items (similar to _ma_poly_residue_feature):
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
New _ma_qa_metric_feature category (similar to ma_qa_metric_local):
- Description: "Data items in this category capture local QA metrics calculated per feature (as defined in _ma_feature_list)."
- Items:
- ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local
- feature_id (mandatory): "The identifier for the feature, for which local QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
New _ma_qa_metric_feature_pairwise category (similar to ma_qa_metric_local_pairwise):
- Description: "Data items in this category capture local QA metrics calculated per pair of features (as defined in _ma_feature_list)."
- Items:
- ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local_pairwise
- feature_id_1 (mandatory): "The identifier for the first feature in the pair, for which local QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- feature_id_2 (mandatory): "The identifier for the second feature in the pair, for which local QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
Note: if it is preferred to use something else instead of "asym_id" in the category name and feature_type, that's also ok...

ihmwg / ModelCIF

Extension of ModelCIF for AF3 quality estimates #21