ihmwg / ModelCIF

mmCIF-based extension dictionary for computed structure models
Creative Commons Zero v1.0 Universal
17 stars 4 forks source link

Extension of ModelCIF for AF3 quality estimates #21

Open gtauriello opened 2 months ago

gtauriello commented 2 months ago

Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for https://github.com/chaidiscovery/chai-lab/issues/52. Here is my suggested additions:

  1. Extend _ma_qa_metric.type to include:
    • "pLDDT to polymer" with detailed description "confidence score predicting accuracy according to lDDT with distances from each atom to CA or C1' of nearby polymer residues in [0,100]"
    • "boolean" with detailed description "0 or 1 depending on whether a check passed (1) or not (0)."
  2. Extend _ma_qa_metric.mode to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)
  3. New _ma_qa_metric_per_chain same as _ma_qa_metric_local but without label_comp_id and label_seq_id
  4. New _ma_qa_metric_per_chain_pairwise same as _ma_qa_metric_local_pairwise but without label_comp_id* and label_seq_id*
  5. New _ma_qa_metric_per_atom same as _ma_qa_metric_local but using atom_id (linked to _atom_site.id) instead of model_id and label_*
  6. New _ma_qa_metric_per_atom_pairwise same as _ma_qa_metric_local_pairwise but but using atom_id_1 and atom_id_2 (linked to _atom_site.id) instead of model_id and label_*

Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a _ma_qa_metric.mode and .type:

Some caveats to consider:

Alternative to the above (which simplifies some things and handles the per token scores):

@brindakv what are your thoughts on this?

gtauriello commented 1 month ago

Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):

Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:

Suggested ModelCIF extension:

aozalevsky commented 2 weeks ago

@gtauriello, I just wanted to follow up on this. With AF3 code and weights being released and with the recent addition of restraints to Chai-1, we can expect rapid growth in the number of deposited models. Would be nice to have the scores in those models.

gtauriello commented 1 week ago

I agree. @brindakv was waiting for me to decide on a separate issue that we wanted to address in the same ModelCIF update and now I added that here as issue #23 . Hence, I think that she can now do the updates according to the open issues here.

Afterwards, we can try to suggest changes in alphafold3/model/mmcif_metadata.py to include this (and check if other things are invalid in their files).

brindakv commented 1 day ago

@gtauriello please clarify my questions below.

  1. Do we need _ma_poly_residue_feature considering that ma_qa_metric_local sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.
  2. Do we want ma_feature_list.feature_type to support contiguous residue ranges? If yes, then _ma_poly_residue_feature can have begin and end data items for seq_id and comp_id.
  3. What is the use case for ma_qa_metric.type = boolean? Should this be a separate data item elsewhere rather than an enumeration of ma_qa_metric.type?
gtauriello commented 1 day ago
1. Do we need `_ma_poly_residue_feature` considering that `ma_qa_metric_local` sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.

The main use case for it is to be able to handle pairs between an atom and a residue in ma_qa_metric_feature_pairwise (needed for AF3's PAE matrix). We would not be able to do it in any other way.

2. Do we want `ma_feature_list.feature_type` to support contiguous residue ranges? If yes, then `_ma_poly_residue_feature` can have begin and end data items for `seq_id` and `comp_id`.

This would make the main existing use case in AF3 more verbose than necessary (we need a feature for each polymer residue to handle the PAE matrix) while I currently do not have a use case for contiguous residue ranges. If we need those ranges in the future, I would prefer to have them in a separate table.

3. What is the use case for `ma_qa_metric.type` = `boolean`? Should this be a separate data item elsewhere rather than an enumeration of `ma_qa_metric.type`?

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

brindakv commented 1 day ago

Thanks for clarifying @gtauriello.

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

Should the enumeration for ma_qa_metric.type be has_clash or boolean?

Never mind. Boolean is good.

brindakv commented 1 day ago

@gtauriello I suggest we add enumerations to _ma_associated_archive_file_details.file_content and _ma_entry_associated_files.file_content.

It can be generic (QA metrics) or specific (feature-based QA scores).

gtauriello commented 20 hours ago

For ma_qa_metric.type: yes for boolean as you concluded already.

For file_content: I had not noticed that one but it is an excellent point. I would go for the generic (QA metrics) option and add a note for local pairwise QA scores that this is deprecated in favor of QA metrics.

brindakv commented 16 hours ago

Thanks @gtauriello. Updates have been committed, please see https://github.com/ihmwg/ModelCIF/pull/25.