soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

Running mmseqs databases creates a significantly reduced tax database #684

Open GiotaKyr opened 1 year ago

GiotaKyr commented 1 year ago

Expected Behavior

Run mmseqs databases to create a taxonomic database that uses recent data from uniref, e.g uniref100

Current Behavior

A recent downloaded uniref100 fasta file has 342650444 however using the mmseqs databases creates a taxonomic database that has only 1462990. I didn't get any error about something failing and couldn't find any information about reduced database.

Steps to Reproduce (for bugs)

mmseqs databases UniRef100 databases_mmseqs/uniref100 tmp2

Context

I am trying to run easy-taxonomy on a fasta file using uniref100. I got a result using the mmseqs database generated taxdb but it was making really bad taxonomic assignments that got me wondering. Eventually I realized that it wasn't containing many of my expected uniref entries and that it is actually significantly smaller than a recent uniref100 fasta download.

milot-mirdita commented 1 year ago

Can you try to repeat the last step of the taxonomic db creation with the following commands:

rm uniref100_taxonomy uniref100_*.dmp uniref100_mapping
mmseqs prefixid uniref100_h uniref100_h.tsv --tsv
awk '{ match($0, / OX=[0-9]+ /); if (RLENGTH != -1) { print $1"\t"substr($0, RSTART+4, RLENGTH-5); next; } match($0, / TaxID=[0-9]+ /); print $1"\t"substr($0, RSTART+7, RLENGTH-8); }' uniref100_h.tsv  | LC_ALL=C sort -n > uniref100_mapping
rm uniref100_h.tsv 
mmseqs createtaxdb uniref100 tmp

How many lines does the uniref100_mapping have after you execute this?