Open GiotaKyr opened 1 year ago
Can you try to repeat the last step of the taxonomic db creation with the following commands:
rm uniref100_taxonomy uniref100_*.dmp uniref100_mapping
mmseqs prefixid uniref100_h uniref100_h.tsv --tsv
awk '{ match($0, / OX=[0-9]+ /); if (RLENGTH != -1) { print $1"\t"substr($0, RSTART+4, RLENGTH-5); next; } match($0, / TaxID=[0-9]+ /); print $1"\t"substr($0, RSTART+7, RLENGTH-8); }' uniref100_h.tsv | LC_ALL=C sort -n > uniref100_mapping
rm uniref100_h.tsv
mmseqs createtaxdb uniref100 tmp
How many lines does the uniref100_mapping have after you execute this?
Expected Behavior
Run mmseqs databases to create a taxonomic database that uses recent data from uniref, e.g uniref100
Current Behavior
A recent downloaded uniref100 fasta file has 342650444 however using the mmseqs databases creates a taxonomic database that has only 1462990. I didn't get any error about something failing and couldn't find any information about reduced database.
Steps to Reproduce (for bugs)
mmseqs databases UniRef100 databases_mmseqs/uniref100 tmp2
Context
I am trying to run easy-taxonomy on a fasta file using uniref100. I got a result using the mmseqs database generated taxdb but it was making really bad taxonomic assignments that got me wondering. Eventually I realized that it wasn't containing many of my expected uniref entries and that it is actually significantly smaller than a recent uniref100 fasta download.