steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

Create human-readable taxonomy lookup table from precomputed database #266

Open cvigilv opened 2 months ago

cvigilv commented 2 months ago

I'm currently trying to use foldseek to prepare some datasets and I would like to check if the taxonomic information of Alphafold/Proteome matches the one I obtained from the FTP server of Alphafold.

Is there any way to convert the binary _taxonomy file into a tab-separated value?

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Foldssek Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 1 month ago

The easiest workaround for this is probably to use slightly abuse addtaxonomy:

mmseqs databases UniProtKB/Swiss-Prot sprot tmp
MMSEQS_FORCE_MERGE=1 mmseqs addtaxonomy sprot sprot_h out
tr -d '\000' out > sprot_headers_with_taxonomy.tsv

Adding a module that exports the nodes/names taxonomy dmp files, would also be possible, but that would need to come from an external contribution as I don't have time to implement this currently.

milot-mirdita commented 1 month ago

That also works the same way for foldseek, just use a Foldseek database and the foldseek binary instead of mmseqs.

cvigilv commented 1 month ago

Thanks! Will give it a test and come back if I encounter any problem

Andy-B-123 commented 2 weeks ago

Hi @cvigilv , could I check if this worked for you? I am trying to follow the process but am getting an unparseable output file. My foldseek version is the binary from ~2 weeks ago. 

foldseek databases UniProtKB/Swiss-Prot sprot tmp
FOLDSEEK_FORCE_MERGE=1 ../foldseek/bin/foldseek addtaxonomy sprot sprot_h out

Output:

addtaxonomy sprot sprot_h out

MMseqs Version:                 62a2558bcad0d78976f6275b896afcd7a38136a9
Column with taxonomic lineage   0
LCA ranks
Extract mode                    2
Compressed                      0
Threads                         128
Verbosity                       3

[=================================================================] 100.00% 542.38K 1s 513ms
Taxonomy for 542378 entries not found and 0 are deleted
Time for merging to out: 0h 0m 1s 330ms
Time for processing: 0h 0m 4s 344ms

Trying to parse the 'out' file with the tr method doesn't work and I get a warning that the out file is a binary if looking at it with less and it looks uniformly malformed, nothing to deliminate or see. I've tried setting the force_merge to either mmseqs or foldseek and also tried exporting it.

I am able to get the mmseqs version working, I just can't seem to get the FoldSeek one working?