MSA format for "unpairedMsa" in fold_input.json

smg3d commented 4 hours ago

Thanks for providing the AF3 source. it is really appreciated.

I could not find the format to use in order to provide our own MSA in the input json file.

The input documentation mentions "If the unpairedMsa field is set to a custom A3M string, AlphaFold 3 will use the provided MSA instead of building one as part of the data pipeline. This is considered an expert option.". But what is the format of the "custom A3M string"

The doc provides the two following examples, but does not show the string or list format for unpairedMsa

{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ...,
    "pairedMsa": ...,
    "templates": [...]
  }
}

and

{
  "protein": {
    "id": "A",
    "sequence": ...,
    "unpairedMsa": "The A3M you want to run with",
    "pairedMsa": "",
    "templates": []
  }
}

For "unpairedMsa": I tried filename and various list formats, but none are working.

Hanziwww commented 4 hours ago

Here is my suggesttion:

Prepare Your MSA: Format your MSA in A3M, which is similar to FASTA but can include lowercase letters for insertions.
Embed MSA Content in JSON: Place your MSA content in the "unpairedMsa" field of the input JSON file. Ensure newline characters are correctly handled with \n.

Example:

{
  "protein": {
    "id": "A",
    "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
    "unpairedMsa": ">seq1\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\\n>seq2\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
    "pairedMsa": "",
    "templates": []
  }
}

Considerations:

Handling Newlines: In JSON strings, newlines should be represented by \\n (in the actual JSON file, it’s \n, but needs escaping in strings).
Direct Embedding: The "unpairedMsa" field should contain the actual MSA content string, not a filename or path.
Validate JSON Format: Make sure your JSON file is correctly formatted. You might want to use an online JSON validator for checking.

smg3d commented 2 hours ago

Thanks @Hanziwww .

Does that input.json work for you? For me, it does not recognize the first sequence of the MSA (looks like it reads an empty sequence):

    raise ValueError(
ValueError: First MSA sequence  is not the query_sequence='MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF'

Hanziwww commented 2 hours ago

Hi @smg3d,

You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.

Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip

Sorry for misleading.

smg3d commented 1 hour ago

Thanks @Hanziwww .

It works now.

I think it might be a good idea to show such an example in the input doc:


{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ">seq1\nPVLSCGEWQL\n>seq2\nPILSCADWQ-",
    "pairedMsa": ...,
    "templates": [...]
  }
}

Hanziwww commented 28 minutes ago

I'm glad to hear that the input is working now.

By the way, I'd like to introduce a user-friendly graphical interface that I developed to solve the JSON generation issue and running AlphaFold 3 predictions. Feel free to check out GUI repository.

google-deepmind / alphafold3

MSA format for "unpairedMsa" in fold_input.json #47