Running mmseqs databases creates a significantly reduced tax database

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

GNU General Public License v3.0

1.37k stars 192 forks source link

Expected Behavior

Run mmseqs databases to create a taxonomic database that uses recent data from uniref, e.g uniref100

Current Behavior

A recent downloaded uniref100 fasta file has 342650444 however using the mmseqs databases creates a taxonomic database that has only 1462990. I didn't get any error about something failing and couldn't find any information about reduced database.

Steps to Reproduce (for bugs)

mmseqs databases UniRef100 databases_mmseqs/uniref100 tmp2

Context

I am trying to run easy-taxonomy on a fasta file using uniref100. I got a result using the mmseqs database generated taxdb but it was making really bad taxonomic assignments that got me wondering. Eventually I realized that it wasn't containing many of my expected uniref entries and that it is actually significantly smaller than a recent uniref100 fasta download.

rm uniref100_taxonomy uniref100_*.dmp uniref100_mapping mmseqs prefixid uniref100_h uniref100_h.tsv --tsv awk '{ match($0, / OX=[0-9]+ /); if (RLENGTH != -1) { print $1"\t"substr($0, RSTART+4, RLENGTH-5); next; } match($0, / TaxID=[0-9]+ /); print $1"\t"substr($0, RSTART+7, RLENGTH-8); }' uniref100_h.tsv | LC_ALL=C sort -n > uniref100_mapping rm uniref100_h.tsv mmseqs createtaxdb uniref100 tmp

soedinglab / MMseqs2