steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

How to download AFDB clusters? #275

Open jzhanghzau opened 1 month ago

jzhanghzau commented 1 month ago

Hi,

First of all, thanks for your amazing work!

I want AFDB clusters to do some analysis, the fields I need are the REPRESENTATIVE FASTA sequence, and the CLUSTER SIZE, it would be nice to have the MSA file for the representative fasta sequence. Which dataset should I download from the foldseek server, and is there a detailed description of these datasets?

Looking forward to your reply.

Thank you.

JJ

jzhanghzau commented 1 month ago

Ah, it seems I can download the file below and then calculate the size of the clusters. Based on their size, I can perform some filtering and subsequently retrieve the FASTA files through the UniProt API. Is that correct? By the way, are both the entryID and repID UniProt IDs? Thanks!

Screen Shot 2024-05-19 at 17 01 05
yeojingi commented 1 month ago

Hi JJ,

As I understood, you are looking for the data that contain (1) the FASTA sequence of representatives (2) cluster size

and wonder if the ids in file no. 1 are uniprot IDs.

Firstly, to get the sequences of the representatives, we are not providing the raw data. As you found, you can get the Uniprot IDs of the representatives to retrieve the seqs by any Uniprot API.

We are providing the cluster information here in file no. 2. The caveat is that it is only about the foldseek clusters. If you want to include sequence cluster members, you have to compute it on yourself.

Lastly, the ids in the picture you attached are Uniprot Ids.

Hope this helped you out

Jingi Yeo

jzhanghzau commented 1 month ago

Hi JJ,

As I understood, you are looking for the data that contain (1) the FASTA sequence of representatives (2) cluster size

and wonder if the ids in file no. 1 are uniprot IDs.

Firstly, to get the sequences of the representatives, we are not providing the raw data. As you found, you can get the Uniprot IDs of the representatives to retrieve the seqs by any Uniprot API.

We are providing the cluster information here in file no. 2. The caveat is that it is only about the foldseek clusters. If you want to include sequence cluster members, you have to compute it on yourself.

Lastly, the ids in the picture you attached are Uniprot Ids.

Hope this helped you out

Jingi Yeo

Thanks!