Open darkcorvushhh opened 1 day ago
We don't support PDB as a format, but if they are in mmCIF format, you can use the Input.from_mmcif()
method to read those, see https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795.
You could use this to write a simple script like this (note that I have not tested this):
import os
from alphafold3.common import folding_input
from alphafold3.constants import chemical_components
input_dir = ...
output_dir = ...
for mmcif_name in os.path.listdir(input_dir):
print(f'Converting {mmcif_name}')
mmcif_file_path = os.path.join(input_dir, mmcif_name)
with open(mmcif_file_path) as f:
mmcif = f.read()
alphafold_input = folding_input.Input.from_mmcif(
mmcif, ccd=chemical_components.cached_ccd()
)
with open(os.path.join(output_dir, f'{mmcif_name.removesuffix(".cif")}.json'), 'wt') as f:
f.write(alphafold_input.to_json())
print('Done')
Thanks! But how can I get the ccd input? Cause from_mmcif needs 2 arguments.
Ah, sorry:
from alphafold3.constants import chemical_components
...
folding_input.Input.from_mmcif(mmcif, ccd=chemical_components.cached_ccd())
I amended the example above as well.
It works! But here's another problem: I used a small portion of the RCSB PDB database to test this code, but the results seem to be wrong. Part of the generated json files contain bondedAtomPairs
, but the prediction reports the error:ValueError: Invalid chain ID(s) in bond [[‘A’, 1, ‘N3’], [‘B’, 10, ‘N1’]]
. I wonder if this is the correct conversion?
Could you give me the PDB ID so I can reproduce?
I test 100D
from https://www.rcsb.org/structure/100D and 200L
from https://www.rcsb.org/structure/200L. And the jsons are listed as follows:
When I was testing 100D, it raise ValueError: Invalid chain ID(s) in bond [['A', 1, 'N3'], ['B', 10, 'N1']]
.
When I was testing 200L, terminal kept printing `W1122 11:40:03.627055 125448981841472 templates.py:699] Failed to get mmCIF for **** (for example 6h7o) even though the result eventually appeared.
So the problem with 100D is that it doesn't contain any protein/RNA/DNA, it has just a RNA/DNA hybrid that AlphaFold 3 doesn't support. That being said, Input.from_mmcif
should not include bonds that involve chains that haven't been included. I will send a fix for that.
200L works as intended. The warnings indicate that you are missing (some) mmCIF template files -- are you sure your paths are set correctly and that you have downloaded all of the PDB mmCIF files?
Is there any code about data processing? Like quickly reading all pdb files in a certain directory and converting them to the json format that alphafold3 needs.