muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Interpretation of reported abundance table #10

Closed donovan-h-parks closed 4 years ago

donovan-h-parks commented 4 years ago

Hi. I'm running MetaCache query with the -abundaces profile.tsv and -abundance-per species flags. It appears this writes two profiling results to profile.tsv: the full taxon profile and a species profile. However, these profiles do not appear to agree. For example, the taxon profile reports Streptococcus dysgalactiae subsp. equisimilis at 1.15% and no other results for S. dysgalactiae. The species profile reports S. dysgalactiae at 1.93%. Why is there a discrepancy?

Relevant lines from profile.tsv:

# query summary: number of queries mapped per taxon
# rank:name | taxid | number of reads | abundance
...
subspecies:Streptococcus dysgalactiae subsp. equisimilis    119602  80740   1.15343%
...
# estimated abundance (number of queries) per species
# rank:name | taxid | number of reads | abundance
...
species:Streptococcus dysgalactiae  1334    135610  1.93728%
...

Is the best prediction of the abundance of S. dysgalactiae by MetaCache 1.15% or 1.93%?

Thanks, Donovan

muellan commented 4 years ago

The first table represents the raw abundances based on the read mapping.

The second table shows the estimated abundance on a specific taxonomic rank. This works as follows (will be described in our upcoming paper about food ingredient detection):

For each taxon in the dataset we count the number of reads assigned to it. Taxa on lower levels than the requested taxonomic rank are pruned and their read counts are added to their respective parents, while reads from taxa on higher levels are distributed among their children in proportion to the weights of the sub-trees rooted at each child. After the redistribution the estimated number of reads and abundance percentages are returned as outputs.

I will also add more detailed explanation to the Markdown documentation of the output options.

muellan commented 4 years ago

So the best prediction would be the second table.

donovan-h-parks commented 4 years ago

Thanks for the quick response. Very helpful.

tothuhien commented 9 months ago

Hi, could I ask in this thread again about 1 detail in the way of redistribution reads. How do you define the weight of each sub-tree in the taxonomic tree? Thanks

Funatiq commented 8 months ago

The weight is the number of reads mapped to taxa in the sub-tree.

tothuhien commented 8 months ago

Thank you for your prompt response!