steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
695 stars 92 forks source link

Inquiry on Establishing Correspondence between 3Di Structure and Amino Acid Sequences in the Database #220

Closed CNwangbin closed 6 months ago

CNwangbin commented 6 months ago

I am currently using the FoldSeek tool for protein structure analysis. This is a significant undertaking that has greatly facilitated the convenient exploration of protein structures in the post-Alphafold era. Now, I want to precisely determine the one-to-one correspondence between the protein amino acid sequence and the protein structure 3Di sequence. For example, I recently downloaded the alphafold-swissprot database (from https://foldseek.steineggerlab.workers.dev/afdb_swissprot.tar.gz), and after extracting the files, I found two key files named afdb_swissprot and afdb_swissprot_ss.

Based on my interpretation, I believe that the afdb_swissprot file contains the amino acid sequences, while the afdb_swissprot_ss file contains the corresponding 3Di structure sequences. It seems that the same line in these files represents the same protein. Additionally, I am interested in obtaining the alphafold identifier or UniProt identifier for each protein. Could you kindly confirm if my understanding of the file contents is accurate? If so, could you provide guidance on how I can obtain the alphafold identifier or UniProt identifier for each protein in the database? Moreover, is there any associated metadata file that might contain further details?

I appreciate your assistance in this matter and look forward to your guidance. Thank you for your time and consideration. image

milot-mirdita commented 6 months ago

You can follow the instructions in this GitHub issue to create "normal" FASTA files: https://github.com/steineggerlab/foldseek/issues/201

CNwangbin commented 6 months ago

Thanks for your quickly reply. It helps me a lot.