Open gtauriello opened 3 weeks ago
Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):
_atom_site.label_atom_id
to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3)_atom_site.id
with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick)._ihm_feature_list
Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:
fold_test_fold_job_number_one_job_request.json
is input to AF3 (can be uploaded to the AF-Server)fold_test_fold_job_number_one_model_0.cif
is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).fold_test_fold_job_number_one_summary_confidences_0.json
contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".fold_test_fold_job_number_one_full_data_0.json
contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.A
: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)B
: polymer (polypeptide; seq: "REACHER")C
: non-polymer (ATP)D
: non-polymer (HEM)E
: non-polymer (HEM)F
: non-polymer (MG)G
: non-polymer (MG)H
: non-polymer (NA)I
: non-polymer (NA)J
: non-polymer (NA)K
: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)L
: polymer (polydeoxyribonucleotide; seq: "TGTAATC")M
: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)N
: branched (NAG-NAG-BMA)O
: branched (BMA)Suggested ModelCIF extension:
_ma_feature_list
exactly like _ihm_feature_list
except "branched" added to entity_type
and feature_type
which should include the following controlled vocabulary:
_ma_atom_feature
category:
_ma_poly_residue_feature
category:
_ma_asym_id_feature
category:
_ma_qa_metric_feature
category (similar to ma_qa_metric_local):
_ma_qa_metric_feature_pairwise
category (similar to ma_qa_metric_local_pairwise):
Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for https://github.com/chaidiscovery/chai-lab/issues/52. Here is my suggested additions:
_ma_qa_metric.type
to include:_ma_qa_metric.mode
to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)_ma_qa_metric_per_chain
same as_ma_qa_metric_local
but withoutlabel_comp_id
andlabel_seq_id
_ma_qa_metric_per_chain_pairwise
same as_ma_qa_metric_local_pairwise
but withoutlabel_comp_id*
andlabel_seq_id*
_ma_qa_metric_per_atom
same as_ma_qa_metric_local
but using atom_id (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
_ma_qa_metric_per_atom_pairwise
same as_ma_qa_metric_local_pairwise
but but using atom_id_1 and atom_id_2 (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a
_ma_qa_metric.mode
and.type
:fraction_disordered
: "global", "normalized score"has_clash
: "global", "boolean"iptm
: "global", "ipTM"ptm
: "global", "pTM"ranking_score
: "global", "normalized score"chain_ptm
: "per-chain", "pTM"chain_iptm
: "per-chain", "ipTM"chain_pair_iptm
: "per-chain-pairwise", "ipTM"chain_pair_pae_min
: "per-chain-pairwise", "PAE"atom_plddts
: "per-atom", "pLDDT to polymer"contact_probs
: "per-atom-pairwise", "contact probability"pae
: "per-atom-pairwise", "PAE"Some caveats to consider:
contact_probs
andpae
above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.label_asym_id
and do not have alabel_seq_id
and one could also change that to giving them separatelabel_asym_id
in ModelCIF to fix this.Alternative to the above (which simplifies some things and handles the per token scores):
_ma_qa_metric_local
and_ma_qa_metric_local_pairwise
to includelabel_atom_id
(linked to_atom_site.label_atom_id
) which can be set to '.' for per-residue scores.label_comp_id
andlabel_seq_id
to be set to '.'._ma_qa_metric_local
and_ma_qa_metric_local_pairwise
tables and no additional tables or_ma_qa_metric.mode
values would be necessary.@brindakv what are your thoughts on this?