Open gtauriello opened 2 months ago
Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):
_atom_site.label_atom_id
to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3)_atom_site.id
with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick)._ihm_feature_list
Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:
fold_test_fold_job_number_one_job_request.json
is input to AF3 (can be uploaded to the AF-Server)fold_test_fold_job_number_one_model_0.cif
is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).fold_test_fold_job_number_one_summary_confidences_0.json
contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".fold_test_fold_job_number_one_full_data_0.json
contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.A
: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)B
: polymer (polypeptide; seq: "REACHER")C
: non-polymer (ATP)D
: non-polymer (HEM)E
: non-polymer (HEM)F
: non-polymer (MG)G
: non-polymer (MG)H
: non-polymer (NA)I
: non-polymer (NA)J
: non-polymer (NA)K
: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)L
: polymer (polydeoxyribonucleotide; seq: "TGTAATC")M
: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)N
: branched (NAG-NAG-BMA)O
: branched (BMA)Suggested ModelCIF extension:
_ma_feature_list
exactly like _ihm_feature_list
except "branched" added to entity_type
and feature_type
which should include the following controlled vocabulary:
_ma_atom_feature
category:
_ma_poly_residue_feature
category:
_ma_asym_id_feature
category:
_ma_qa_metric_feature
category (similar to ma_qa_metric_local):
_ma_qa_metric_feature_pairwise
category (similar to ma_qa_metric_local_pairwise):
@gtauriello, I just wanted to follow up on this. With AF3 code and weights being released and with the recent addition of restraints to Chai-1, we can expect rapid growth in the number of deposited models. Would be nice to have the scores in those models.
I agree. @brindakv was waiting for me to decide on a separate issue that we wanted to address in the same ModelCIF update and now I added that here as issue #23 . Hence, I think that she can now do the updates according to the open issues here.
Afterwards, we can try to suggest changes in alphafold3/model/mmcif_metadata.py to include this (and check if other things are invalid in their files).
@gtauriello please clarify my questions below.
_ma_poly_residue_feature
considering that ma_qa_metric_local
sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it. ma_feature_list.feature_type
to support contiguous residue ranges? If yes, then _ma_poly_residue_feature
can have begin and end data items for seq_id
and comp_id
. ma_qa_metric.type
= boolean
? Should this be a separate data item elsewhere rather than an enumeration of ma_qa_metric.type
? 1. Do we need `_ma_poly_residue_feature` considering that `ma_qa_metric_local` sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.
The main use case for it is to be able to handle pairs between an atom and a residue in ma_qa_metric_feature_pairwise
(needed for AF3's PAE matrix). We would not be able to do it in any other way.
2. Do we want `ma_feature_list.feature_type` to support contiguous residue ranges? If yes, then `_ma_poly_residue_feature` can have begin and end data items for `seq_id` and `comp_id`.
This would make the main existing use case in AF3 more verbose than necessary (we need a feature for each polymer residue to handle the PAE matrix) while I currently do not have a use case for contiguous residue ranges. If we need those ranges in the future, I would prefer to have them in a separate table.
3. What is the use case for `ma_qa_metric.type` = `boolean`? Should this be a separate data item elsewhere rather than an enumeration of `ma_qa_metric.type`?
The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash
. I would like to be able to properly store all components of that and has_clash
is a boolean pass/fail score (1 = pass, 0 = fail).
Thanks for clarifying @gtauriello.
The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).
Should the enumeration for ma_qa_metric.type
be has_clash
or boolean
?
Never mind. Boolean is good.
@gtauriello I suggest we add enumerations to _ma_associated_archive_file_details.file_content
and _ma_entry_associated_files.file_content
.
It can be generic (QA metrics
) or specific (feature-based QA scores
).
For ma_qa_metric.type
: yes for boolean
as you concluded already.
For file_content
: I had not noticed that one but it is an excellent point. I would go for the generic (QA metrics
) option and add a note for local pairwise QA scores
that this is deprecated in favor of QA metrics
.
Thanks @gtauriello. Updates have been committed, please see https://github.com/ihmwg/ModelCIF/pull/25.
Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for https://github.com/chaidiscovery/chai-lab/issues/52. Here is my suggested additions:
_ma_qa_metric.type
to include:_ma_qa_metric.mode
to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)_ma_qa_metric_per_chain
same as_ma_qa_metric_local
but withoutlabel_comp_id
andlabel_seq_id
_ma_qa_metric_per_chain_pairwise
same as_ma_qa_metric_local_pairwise
but withoutlabel_comp_id*
andlabel_seq_id*
_ma_qa_metric_per_atom
same as_ma_qa_metric_local
but using atom_id (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
_ma_qa_metric_per_atom_pairwise
same as_ma_qa_metric_local_pairwise
but but using atom_id_1 and atom_id_2 (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a
_ma_qa_metric.mode
and.type
:fraction_disordered
: "global", "normalized score"has_clash
: "global", "boolean"iptm
: "global", "ipTM"ptm
: "global", "pTM"ranking_score
: "global", "normalized score"chain_ptm
: "per-chain", "pTM"chain_iptm
: "per-chain", "ipTM"chain_pair_iptm
: "per-chain-pairwise", "ipTM"chain_pair_pae_min
: "per-chain-pairwise", "PAE"atom_plddts
: "per-atom", "pLDDT to polymer"contact_probs
: "per-atom-pairwise", "contact probability"pae
: "per-atom-pairwise", "PAE"Some caveats to consider:
contact_probs
andpae
above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.label_asym_id
and do not have alabel_seq_id
and one could also change that to giving them separatelabel_asym_id
in ModelCIF to fix this.Alternative to the above (which simplifies some things and handles the per token scores):
_ma_qa_metric_local
and_ma_qa_metric_local_pairwise
to includelabel_atom_id
(linked to_atom_site.label_atom_id
) which can be set to '.' for per-residue scores.label_comp_id
andlabel_seq_id
to be set to '.'._ma_qa_metric_local
and_ma_qa_metric_local_pairwise
tables and no additional tables or_ma_qa_metric.mode
values would be necessary.@brindakv what are your thoughts on this?