steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
817 stars 101 forks source link

AFDB cluster Reps for Alphafold/UniProt50-minimal #245

Closed timghaly closed 8 months ago

timghaly commented 8 months ago

Thanks Foldseek team for this great tool and amazing work clustering the AFDB. I am wondering where I can found out which AFDB cluster each of the AFDB50 rep sequences from the Foldseek database 'Alphafold/UniProt50-minimal' belong. The 1-AFDBClusters-entryId_repId_taxId.tsv.gz file from https://afdb-cluster.steineggerlab.workers.dev/ has the info that I'm after, but seems that this has not been updated with the increase in AFDB50 size. The number of member seqs in '1-AFDBClusters-entryId_repId_taxId.tsv.gz' is ~30million, while there are ~50million protein seqs in the Alphafold/UniProt50-minimal database. Do you have the cluster-member relationships for the remaining 20million seqs?

Many thanks for your help!

Kindest regards, Tim

timghaly commented 8 months ago

My apologies, I just realised that the AFDB50 includes proteins that did no get included in any AFDB cluster after fragments and Foldseek cluster singletons were removed. I think that solved my issue.

Cheers, Tim

martin-steinegger commented 8 months ago

Thank you!