Effect of Kraken2 confidence threshold on Bracken abundance estimation

jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

http://ccb.jhu.edu/software/bracken/index.shtml

GNU General Public License v3.0

273 stars 50 forks source link

Effect of Kraken2 confidence threshold on Bracken abundance estimation #154

Open manu-script opened 3 years ago

manu-script commented 3 years ago

Hi @jenniferlu717, Since Bracken calculates the probabilities of taxonomic assignment at different levels for reads of a certain length using the default Kraken2 confidence threshold of 0 during the database generation step, doesn't it require the Kraken2 report file provided for abundance estimation also be run with the same confidence threshold of 0? If we use a certain confidence threshold higher than 0, doesn't it alter the assumptions of Bracken regarding the probabilities of read classification at higher taxonomic levels? Does it make sense or the Kraken2 confidence thresholds doesn't matter for Bracken's estimation?

Thanks, Manu

andrewjmc commented 2 years ago

Manu, I was just thinking the same thing! In the manual, instructions are given to manually generate the data for bracken using kraken, and no confidence threshold is specified. It seems almost certain that the confidence threshold would have a large bearing.

A simple thought experiment shows that the probability of a given kmer being classified at genus level rather than species level will depend on the number of species.

If we're expecting coding sequences (and let's imagine same for all sequences) to be 95% identical between species within a genus, the average 35-mer (default size for kraken2) should have 1-2 differences between any two species, and the probability of any two species within a genus sharing a specific kmer should be 17% (0.95^35). If there are, let's say, 20 species within a genus, a given kmer from one species would (on average) be shared by 3 other species. I estimate a 3% chance that a kmer from one species would have no duplicates among the 19 sibling species. If there were only 5 species, then it's probably more like 50/50.

It's crude logic, but hopefully helps! If you alter the call to kraken from the bracken manual, you could easily generate a confidence-level specific database. I will do the same shortly.

manu-script commented 2 years ago

Exactly Andrew! Glad I'm not the only one thinking this way! I actually ended up doing the same by manually generating a confidence-level specific Bracken database. Since this is not documented anywhere, I was wondering if everyone were just blindly using the default Bracken database generated with a Kraken2 confidence threshold of 0 to estimate the abundance on Kraken2 reports generated with a higher confidence threshold! If this indeed affects abundance estimation, it would be helpful if a parameter to specify a confidence threshold is implemented in the bracken-build program.

andrewjmc commented 2 years ago

Have you compared the differences in abundances when you use the default or customised bracken database? Would be interesting to see if the deviations in taxa relate to the number of siblings

manu-script commented 2 years ago

No I haven't compared the differences since I am dealing with quite large datasets. Would be interesting to systematically compare with a smaller dataset. Will report here if I get a chance to do it but do post here if you run a comparison.

jianshu93 commented 2 years ago

Hello All, it is also my concern when using the default kraken2 database, I would be very interested to know how this will affect the final results. And many thanks for @manu-script and @andrewjmc

Jianshu

janepascar commented 2 years ago

@manu-script can you please elaborate how you generated a confidence-level specific Bracken database? I am also assigning a custom --confidence threshold but had not thought about how this was impacting my Bracken results. I think that is why I am getting so many species identified by bracken within the same genus but in reality when I look at the kraken2 output files only a small proportion of the kmers are actually matching kmers from the species Tax ID while most match the genus ID. Thank you!

manu-script commented 2 years ago

@janepascar, since the bracken-build command doesn't allow to specify a confidence level, I used the hard way of building the bracken database as explained in the readme section here. Specifically, in the step 1a, you need to run kraken2 with the confidence level that you used to process your data previously. The rest of the steps 1b & 1c remain the same as described. Hope it is clear.

janepascar commented 2 years ago

Thank you for the fast response @manu-script!

Specifically, in the step 1a, you need to run kraken2 with the confidence level that you used to process your data previously.

So the command would be: kraken2 --db=${KRAKEN_DB} --threads=10 --confidence [# between 0, 1] <( find -L library (-name ".fna" -o -name ".fa" -o -name "*.fasta" ) -exec cat {} + ) > database.kraken

manu-script commented 2 years ago

That's right, @janepascar! It would be great to hear your thoughts on using bracken with and without the confidence thresholds once you have analyzed your data both ways.

andrewjmc commented 2 years ago

I did the same, by the way. I'll try to find the time to compare the conf-specific bracken correction with the usual one.

jorondo1 commented 1 year ago

Hi! I am exploring the impact of --confidence on the kraken/bracken output as well. Has any one of you come to some conclusion or do you have any recommendations ? Here is what I did so far https://github.com/DerrickWood/kraken2/issues/265#issuecomment-1285615638

looking forward to hearing your thoughts...

anam-25 commented 2 weeks ago

Dear @manu-script @andrewjmc @janepascar @jianshu93 @jorondo1, I would like to know if you people found any difference in confidence specific bracken database and default (confidence = 0.0 ) bracken database. is there any chance by which it is reducing the number of genomes classified in the very first step.