steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
695 stars 92 forks source link

How can I extract uniprot ids and corresponding foldseek sequences from a pre-generated database? #201

Open LTEnjoy opened 8 months ago

LTEnjoy commented 8 months ago

Hi!

Thank you for your great work! I have a question that whether I can download a pre-generated database and manually generate a fasta file containing all protein names and corresponding foldseek sequences.

For example for the alphafold_swissprot database, I want to extract from this database all UniProt IDs and foldseek sequences and write it into a fasta file like:

uniprot_id_1 xxxxxxxxxxxxxxxxxxx

uniprot_id_2 xxxxxxxxxxxxxxxxxxxxxxxxxx

Thank you in advance and I'm looking forward to your reply!

milot-mirdita commented 8 months ago

You can use createsubdb with a list of accessions and then call covert2fasta to make a FASTA file:

foldseek createsubdb accession_list alphafold_swissport afsp_subset --id-mode 1
foldseek convert2fasta afsp_subset afsp_subset.fasta

Please check that the accessions you pass are in the same format as the ones that are stored in the second column of the alphafold_swissport.lookup file.

LTEnjoy commented 8 months ago

Thank for your quick reply! I tried above commands and it indeed generated a fasta file!

It's just slightly different than what I thought as I want to get sequences encoded by foldseek, not the residue sequences. Could you tell me how to generate that kind of fasta file?

Thank you again!

milot-mirdita commented 8 months ago

You mean the 3Di sequences?

foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1
foldseek lndb alphafold_swissport_h afsp_subset_ss_h
foldseek convert2fasta afsp_subset_ss afsp_subset_ss.fasta
LTEnjoy commented 8 months ago

That's exactly what I want!

Thank you very much! Have a nice day!

LTEnjoy commented 8 months ago

Hello,

When I tried the command foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1 on af50db, I got these errors:

1698823252767

Could you tell how I can fix this problem? I want to generate all UniProt 3Di sequences from this database.

milot-mirdita commented 8 months ago

I think you have to run first:

ln -s alphafold_swissport.lookup alphafold_swissport_ss.lookup
LTEnjoy commented 8 months ago

I just tried this command, but errors still exist. image

Also, here are some contents in my accession_list.txt: image

milot-mirdita commented 8 months ago

Could you please post all commands you executed (preferably as text and not as screenshots)? I am not sure what's going on currently.

LTEnjoy commented 8 months ago

Hi,

I guess I found what the problem was. the afdb50 only contains 50M sequences after clustering. But what I need is to generate sequences from the whole UniProt database (with ~200M sequences). So I downloaded the afdb database, by which I think the problem should be solved.