lucidrains / alphafold3-pytorch

Implementation of Alphafold 3 in Pytorch
MIT License
1.16k stars 136 forks source link

Missing Unresolved Residues in `mmcif_parsing` #284

Open v-shaoningli opened 1 month ago

v-shaoningli commented 1 month ago

Hi All!

Thank you for your effort in developing the open-source AF3.

Issue Description

I have encountered an issue with the mmcif_parsing module related to unresolved residues. It appears that when the protein sequence is parsed directly from the structure object in Biopython, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site) — are not included in the MmcifObject.

Impact

We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).

Example

For instance, the actual sequence for the protein with PDB ID 7a4d is:

QVQLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGRYPYDVPDYGSGRA

However, when using mmcif_parsing, the parsed sequence is:

QLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGR

Additional Observations

We also noticed that the cached MSAs in data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m are computed based on the latter sequence, which excludes the unresolved residues.

Request for Assistance

Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.

Best regards, Shaoning

amorehead commented 1 month ago

Hi, @v-shaoningli. I'm glad to see you've found this work useful so far.

As you said, mmcif_parsing relies initially on Biopython to parse the mmCIF input files' metadata, after which we manually collect all atoms associated with coordinate data here (following AF2's parsing logic). This parsing logic is quite complex to account for numerous edge cases that can arise when working with heterogeneous PDB complexes. It's possible to modify this function to accommodate the use case you've outlined above, but I will warn you that other side effects may easily "leak" into the downstream components of the codebase without rigorous unit testing afterwards.

If you have additional questions along the way, let me know. Best of luck.

Best, Alex