google-deepmind / alphafold3

AlphaFold 3 inference pipeline.
Other
5.07k stars 563 forks source link

mmCIF to input JSON #95

Open darkcorvushhh opened 1 day ago

darkcorvushhh commented 1 day ago

Is there any code about data processing? Like quickly reading all pdb files in a certain directory and converting them to the json format that alphafold3 needs.

Augustin-Zidek commented 1 day ago

We don't support PDB as a format, but if they are in mmCIF format, you can use the Input.from_mmcif() method to read those, see https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795.

You could use this to write a simple script like this (note that I have not tested this):

import os
from alphafold3.common import folding_input
from alphafold3.constants import chemical_components

input_dir = ...
output_dir = ...

for mmcif_name in os.path.listdir(input_dir):
  print(f'Converting {mmcif_name}')
  mmcif_file_path = os.path.join(input_dir, mmcif_name)

  with open(mmcif_file_path) as f:
    mmcif = f.read()

  alphafold_input = folding_input.Input.from_mmcif(
      mmcif, ccd=chemical_components.cached_ccd()
  )

  with open(os.path.join(output_dir, f'{mmcif_name.removesuffix(".cif")}.json'), 'wt') as f:
    f.write(alphafold_input.to_json())

print('Done')
darkcorvushhh commented 8 hours ago

Thanks! But how can I get the ccd input? Cause from_mmcif needs 2 arguments.

Augustin-Zidek commented 8 hours ago

Ah, sorry:

from alphafold3.constants import chemical_components
...

folding_input.Input.from_mmcif(mmcif, ccd=chemical_components.cached_ccd())

I amended the example above as well.

darkcorvushhh commented 6 hours ago

It works! But here's another problem: I used a small portion of the RCSB PDB database to test this code, but the results seem to be wrong. Part of the generated json files contain bondedAtomPairs, but the prediction reports the error:ValueError: Invalid chain ID(s) in bond [[‘A’, 1, ‘N3’], [‘B’, 10, ‘N1’]]. I wonder if this is the correct conversion?

Augustin-Zidek commented 6 hours ago

Could you give me the PDB ID so I can reproduce?

darkcorvushhh commented 5 hours ago

I test 100D from https://www.rcsb.org/structure/100D and 200L from https://www.rcsb.org/structure/200L. And the jsons are listed as follows:

100d.json 200l.json

When I was testing 100D, it raise ValueError: Invalid chain ID(s) in bond [['A', 1, 'N3'], ['B', 10, 'N1']].

When I was testing 200L, terminal kept printing `W1122 11:40:03.627055 125448981841472 templates.py:699] Failed to get mmCIF for **** (for example 6h7o) even though the result eventually appeared.

Augustin-Zidek commented 2 hours ago

So the problem with 100D is that it doesn't contain any protein/RNA/DNA, it has just a RNA/DNA hybrid that AlphaFold 3 doesn't support. That being said, Input.from_mmcif should not include bonds that involve chains that haven't been included. I will send a fix for that.

200L works as intended. The warnings indicate that you are missing (some) mmCIF template files -- are you sure your paths are set correctly and that you have downloaded all of the PDB mmCIF files?