Questions about the chain/interface clustering files

zqcai19 commented 6 days ago

@lucidrains @amorehead Hi, thank you very much for your efforts in the reproduction of AlphaFold3. I have downloaded the preprocessed mmCIF files and chain/interface clustering files as described in the README and would like to use the clustered test set to evaluate AF3.

Based on my understanding, the json, csv, and fasta files should contain information on the chain IDs, cluster mapping, and sequences. However, I noticed inconsistencies between them and the RCSB PDB. For example, in filtered_all_chain_sequences.json:

8a14-assembly1: The file only records 2 chains, whereas RCSB shows that it has 6 chains.
8sza-assembly1: The file does not seem to include ligand information.
The sequences in both cases appear to be cropped compared to the original sequences in RCSB.

Other entries have similar inconsistencies as well. Am I missing something here? How to use the chain/interface clustering files to evaluate AF3?

Thank you in advance for your help!

amorehead commented 6 days ago

Hi, @zqcai19.

My first thoughts are that these differences may be the result of the PDB dataset's preprocessing scripts, as described in the AF3 paper. This preprocessing script will (in several cases) drop residues or chains that do not meet AF3's strict filtering criteria. For more details, I recommend reviewing the preprocessing scripts in scripts/, and let know if you have any other questions.

zqcai19 commented 6 days ago

@amorehead Thank you for the quick response! I still have some doubts regarding the evaluation process. Should I use the filtered and cropped sequences from filtered_all_chain_sequences.json for inference? I couldn’t find any description in the AF3 paper or its Supplementary Information about cropping the sequences for the evaluation (only the training process was mentioned). Did I miss something?

amorehead commented 6 days ago

Hi, @zqcai19. This filtering of the train, val, and test structures (particularly for the test structures) seems to be implicitly suggested by the AF3 paper. To standardize all three dataset splits, this is how I interpreted the paper.

lucidrains / alphafold3-pytorch

Questions about the chain/interface clustering files #307