muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Results Output Visualization #31

Closed punnettsun closed 2 years ago

punnettsun commented 2 years ago

Hi,

I built a custom database and have used the following command to query my paired end fastq files: ./metacache query Metacache_DB ${INPUT1} ${INPUT2} -pairfiles -abundances test_abundance.txt -lowest species -highest domain -out test_results.txt -threads ${THREAD_NUM} -omit-ranks -taxids-only -separate-cols -lineages

This produces an abundance text file and a results file. The results file looks something like below:

# TABLE_LAYOUT: query_header    |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid   |   taxid
READ:INFO:12701/1   |   10115   |   0   |   8909    |   0   |   0   |   0   |   8103    |   0   |   6795    |   5817    |   0   |   3719    |   0   |   0   |   2

I know that the first taxid column will contain the lowest possible rank for that read, so I am thinking of going through each read in the results file, extract the first taxid, and then obtain its corresponding higher ranks like genus, family, order, etc. using my custom taxonomic hierarchy file and add a count to each of those ranks. In this particular example, the taxid 10115 is the species Acetobacter persici. The genus would be Acetobacter, family Acetobacteraceae, order Acetobacterales, class Alphaproteobacteria, phylum Proteobacteria, and domain bacteria. I would add one count to each of these names, so by the end of my script run, I should have the read counts for each of the taxonomic ranks.

Is this the right way of going about this if I want to visualize my samples across various taxonomic ranks using an R package like phyloseq?

Thank you.

Funatiq commented 2 years ago

That seems reasonable, but I have not used phyloseq so I don't know which input it expects. However, we have used Krona before to visualize the abundances. For Krona you don't need to sum up each taxon yourself IIRC (see Importing NCBI Taxonomy IDs).