muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Interpreting abundance profile #26

Closed donovan-h-parks closed 2 years ago

donovan-h-parks commented 2 years ago

Hi,

I'm a little confused at how to interpret the -abundance file generated by MetaCache. In particular, I'm confused at how to interpret the results with -lower set to subspecies. In a simple mock I have, if I run MetaCache with -lower at the default I get Salmonella enterica subsp. enterica reported at 8.9%. However, if I run MetaCache with -lower set to subspecies, I get Salmonella enterica subsp. enterica reported at 1.9%.

I'm unclear at why this discrepancy occurs? Which result is generally consider more accurate?

Thanks, Donovan

donovan-h-parks commented 2 years ago

Hi,

Related to the above question, I understand that -abundance-per can be used to redistribute the assignment of reads based on a statistical argument. This seems useful and like it will often be the best estimate at any given rank. As such, is it possible to generate such results at multiple ranks (e.g. species, genus, family, ...). I'm often interested in results at multiple ranks.

Thanks, Donovan

Funatiq commented 2 years ago

Hi Donovan,

regarding your first question: Per default MetaCache considers the top 2 targets (sequences from different genomes) with the most hits in the database for each read. If both hit counts are similar, MetaCache uses the lowest common ancestor (LCA) of the two taxa for classification, which could be the common subspecies in your case. If you set -lowest to subspecies MetaCache considers the top 2 subspecies with most hits and determines the LCA if the hit counts are close. In your case targets with less hits than the top 2 may come from a different subspecies, but still get a relevant number of hits. These would be discarded on the sequence level (default) but are taken into account on the subspecies level. You can also use -maxcand to increase the number of considered candidate targets.

Regarding -abundance-per on multiple ranks: It's currently not possible to generate results for multiple ranks in a single run. I would suggest to use the interactive query mode for multiple runs with different -abundance-per settings. This way the database is loaded only once, which avoids a time consuming part of the query.

donovan-h-parks commented 2 years ago

Great - thanks.

donovan-h-parks commented 2 years ago

Hi,

How does one use "interactive query mode"? I can't seem to find documentation regarding this mode of operation.

Thanks.

donovan-h-parks commented 2 years ago

Nevermind - found it. :)