google-deepmind / alphafold3

AlphaFold 3 inference pipeline.
Other
5.06k stars 563 forks source link

Empty MSA #67

Closed tony9664 closed 1 day ago

tony9664 commented 4 days ago

I wanted to run AF3 without MSA. From the documentation https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md I learned that you can set unpaired MSA to an empty string. I used the input file below:

{
  "name": "2PV7-nomsa",
  "sequences": [
    {
      "protein": {
        "id": ["A"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG",
        "unpairedMsa": "",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}
~     

However, the code will still try to search for MSA and templates.

Also I get confused about the documentation. At one place it says :

Note that if you set the unpairedMsa field for a particular protein entity, you will also have to explicitly set the pairedMsa field (typically to empty string) and templates (either to a list of templates, or an empty list to run template-free).

later it says:

When setting unpairedMsa manually, the pairedMsa must be left unset (i.e. the pairedMsa key should not be present in the JSON).

what should I do with the pairedMSA?

My third question is, for a hetero-oligomer prediction, if I want to manually set the MSA, should I put the same MSA under each protein entity?

Augustin-Zidek commented 4 days ago

However, the code will still try to search for MSA and templates.

This is weird, the input is a correct MSA-free and templates-free input. Doesn't it say in the logs that it is skipping MSA/template search?

what should I do with the pairedMSA?

Set it it empty.

My third question is, for a hetero-oligomer prediction, if I want to manually set the MSA, should I put the same MSA under each protein entity?

If the MSA is the same for all 4 chains, you can use the multi chain ID trick: "id": ["A", "B", "C", "D"], and then set the MSA just once for this multi-chain entity.

If the MSA is different for each chain in the oligomer, then you will have to set it separately.

tony9664 commented 3 days ago

This is weird, the input is a correct MSA-free and templates-free input. Doesn't it say in the logs that it is skipping MSA/template search?

I tried again and the code will still search for MSA. Although when I set a custom non-empty MSA it will say skipping searching.

nzrandol commented 3 days ago

This seems to work for me if I set run_data_pipeline to False.

Augustin-Zidek commented 3 days ago

I checked the code and it is a bug, I will send a fix soon (likely tomorrow). Thank you very much for reporting!

There are two possible workarounds for the time being:

  1. Skipping the data pipeline completely by setting the --run_data_pipeline=false as suggested by @nzrandol. However, this could be undesirable if you for instance have a dimer and want to run the data pipeline for one chain, but not for the other. In that case you should use option 2.
  2. Providing MSAs with just the query sequence and empty templates, e.g. for a query GMRESYAN, you would set:

    "id": ["A"],
    "sequence": "GMRESYAN",
    "unpairedMsa": ">query\nGMRESYAN",
    "pairedMsa": ">query\nGMRESYAN",
    "templates": []
zzhangzzhang commented 2 days ago

I'd like to run it with MSA but without templates. How do I know if it's still using template when I set templates = []?

The log says Filtering protein templates took 0.00 seconds for sequence. Does this indicate there is no template used? or there is a better way to double check?

Thanks a lot!

tony9664 commented 2 days ago

I checked the code and it is a bug, I will send a fix soon (likely tomorrow). Thank you very much for reporting!

There are two possible workarounds for the time being:

1. Skipping the data pipeline completely by setting the `--run_data_pipeline=false` as suggested by @nzrandol. However, this could be undesirable if you for instance have a dimer and want to run the data pipeline for one chain, but not for the other. In that case you should use option 2.

2. Providing MSAs with just the query sequence and empty templates, e.g. for a query `GMRESYAN`, you would set:
   ```json
   "id": ["A"],
   "sequence": "GMRESYAN",
   "unpairedMsa": ">query\nGMRESYAN",
   "pairedMsa": ">query\nGMRESYAN",
   "templates": []
   ```

Thank you!

zzhangzzhang commented 1 day ago

I'd like to run it with MSA but without templates. How do I know if it's still using template when I set templates = []?

The log says Filtering protein templates took 0.00 seconds for sequence. Does this indicate there is no template used? or there is a better way to double check?

Thanks a lot!

I think it's still using template even though I have "templates": []. I got log with 'Filtering protein templates for sequence' and 'Filtering protein templates took 0.01 seconds for sequence', and the predicted structure is exactly the same as predicted without template.

Augustin-Zidek commented 1 day ago

Fixed in https://github.com/google-deepmind/alphafold3/commit/f2579c94952ea38e7e5b47156e105fe9e3ed99bb.

I am also planning to add a separate option to skip just templates but search for MSA, tracking that in https://github.com/google-deepmind/alphafold3/issues/88.