muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Classification abundances output not showing all ranks #15

Closed punnettsun closed 3 years ago

punnettsun commented 3 years ago

Hi again,

I just ran MetaCache on a sample dataset. I used the following command: ./metacache query mydb.db test_genomes/ -out test_results.txt -lowest subspecies -abundances test_abundances.txt When I took a look at the test_abundances.txt file, I only see rows for domain, family, and subspecies abundances like below:

domain:Bacteria |   3   |   56893   |   0.123854%
family:Actinomycetaceae |   101 |   319 |   0.00069445%
subspecies:strain_158623    |   8058    |   1   |   2.17696e-06%
subspecies:strain_287365    |   8174    |   4422    |   0.00962652%
subspecies:strain_627946    |   8239    |   167 |   0.000363552%
subspecies:strain_623479    |   10643   |   246807  |   0.537289%

I do not see species, genus, order, class, or phylum abundances. However, when I ran the above command with -abundance-per phylum, I was able to get the abundances for all the phylum ranks. Then I decided to use -allranks command that was mentioned in the documentation to get all ranks but this always brings me to the help documentation on my command line. I am not sure how to get -allranks working. My main goal is to classify all my reads at all the different ranks. Please let me know how I can do this or if I am making any mistakes here.

Edit: I suppose -allranks only works with -precision and for that I would need the header information set correctly.

Thank you.

muellan commented 3 years ago

Hi,

Metacache generally shows the lowest possible rank/taxon, i.e., the most precise mapping it can make. The -lowest <rank> and -highest <rank> options only restrict the range along the taxonomic tree.

If the option -abundance-per <rank> is given, it tries to restrict abundance calculation to one rank only. This will put a second table below the abundance mapping table (after the line starting with "# estimated abundance ..."). If you want this for a number of different ranks then you could run it in a loop (loading the database only once) with a shell script:

#!/bin/bash
DATABASE="mydb.db"
INPUT="test_genomes/"
OPTIONS=""
COMMANDS=""
for RANK in subspecies species genus family phylum; do
    COMMANDS="${COMMANDS}${INPUT} -abundance-per ${RANK} -abundances abundances_${RANK}.txt -out results_${RANK}.txt ${OPTIONS}\n"
done
# pipe all commands into Metacache running in interactive mode
echo -e $COMMANDS | ./metacache query ${DATABASE}

If you want to calculate/visualize the abundances using some other tools (e.g., Krona) you probably want the full output of all assigned taxon ids on all ranks for each individual read using options -omit-ranks -taxids-only -separate-cols -lineages.

punnettsun commented 3 years ago

Great! Thank you for the shell script. It's just what I needed.

punnettsun commented 3 years ago

I used the script you provided and am seeing a FAIL message like below. The resulting files do have the contents inside but I do not know why it says file format is not recognized on the terminal.

metacache_error
punnettsun commented 3 years ago

This problem was solved after creating the database properly.