merenlab / merenlab.org

Web content for Meren Lab
http://merenlab.org
MIT License
11 stars 31 forks source link

removing 'relative abundance' from krakenuniq description #92

Closed AstrobioMike closed 4 years ago

AstrobioMike commented 4 years ago

Hiya!

So krakenuniq (and all the krakens) have never been about relative abundance it turns out, and it isn't a good idea to interpret them that way. I've gotten pretty deep into this before and have been looking at it again recently, and a colleague was leaning towards utilizing the output of krakenuniq in terms of relative abundances in part because of the wording on this anvi'o workflows page. So wanted to get that specific wording out of there. In case this is new info, I'll provide more details below. But if that was just an oversight/over-simplification, then feel free to stop reading here and accept my change :)

Details if wanted kraken* were designed for classification/detection, rather than relative abundance estimation. When it is unsure about a read, it bumps it up a taxonomic rank (which throws off the "relative abundance" of that lower rank by not having anything assigned there for that read anymore. This directly artificially inflates the relative abundance of anything that is actually assigned at that level (while completely eliminating those that just happen to be equally similar to more than one thing in the database).

Jen Lu et al created Bracken to specifically address this. As noted in their post here about it:

Last year we discovered that some people were using Kraken directly for abundance estimation – for estimating the relative proportions of species in a sample – and were publishing papers based on the assumption that Kraken’s output can be used this way. However, this is incorrect. If you give Kraken a set of metagenomic reads to classify, it will assign to each read the most specific label it can. Many times, though, these labels are not at the species level. For instance, if a 150bp read is 100% identical to two different species, Kraken will assign it to their lowest common ancestor (LCA), which could be at the genus level or higher. For a sample containing two or more highly similar species, this means that the number of species-specific reads may be far less than expected. (We should note that Kraken often assigns reads at the strain level as well.)

To address this issue, we developed Bracken: Bayesian Re-estimation of Abundance after Classification with KrakEN. Bracken uses a Bayesian algorithm and the Kraken classification results to estimate species-level or genus-level abundances for a metagenomic sample.

This is the case for kraken2 also (which works with bracken), and though while not yet supported by bracken, krakenuniq is is the same boat. This is discussed a little in this krakenuniq issue, and in this bracken issue, though it has been inactive for some time.

Just to further ensure that krakenuniq wasn't re-designed in a way to be utilized for relative abundance information, i point to this paper that came out recently evaluating centrifuge, clark, and krakenuniq. They note in the abstract:

Binary mixtures of bacteria showed all three reliably identified organisms down to 1% relative abundance, while only the relative abundance estimates of Centrifuge and CLARK were accurate.

Which they expound upon in the discussion:

In contrast to Centrifuge and CLARK’s abundance estimates, KrakenUniq classified the majority of S. flexneri and E. coli reads to the family (Enterbacteriacae) level, only estimating 11.6% relative abundance of S. flexneri when it was in fact 99.9% of the sample. KrakenUniq’s assignment of the majority of the S. flexneri/E. coli reads to a higher taxonomic level results from its strategy for taxonomic assignment of reads. Specifically, reads from closely related organisms in which a read that could be assigned to multiple species are instead assigned to the nearest common taxonomic level. Therefore, the KrakenUniq abundance estimates are not strictly comparable to CLARK and Centrifuge without further analysis and re-calibration.

So yeah, just wanted to remove the line saying krakenuniq was providing relative abundance information to help lessen the confusion out there about this :)

meren commented 4 years ago

Good catch! Thank you for the correction :)