jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
286 stars 50 forks source link

Reads being redistributed to lower abundance species over higher abundance species within the Bacillus genus #233

Closed nphage closed 11 months ago

nphage commented 1 year ago

I am using Kraken2 to classify some libraries with Bacillus subtilis in them using default parameters and the 12/9/2022 version of the pre-made PlusPF index and then running Bracken with the same index and a read size of 200. I am having an issue where Bracken is consistently redistributing reads to specific lower abundance species over the other more abundant Bacillus species. It seems like it is first redistributing more reads than it should to the next higher taxonomic level (G1, unclassified Bacillus, TaxID: 185979) and then preferentially assigning the majority of those extra reads somewhat consistently to specific species like (Bacillus sp. LUNF1, TaxID: 2709781). In some cases these low abundance species are getting assigned more reads than B. subtilis which clearly has more reads in the Kraken2 report file. For example:

This is from the Kraken2 report. image

This is from the Bracken report. image

This is the most simple example I have, but in a variety of experiments with Bacillus subtilis Bracken redistributes more reads than it seems like it should to the G1 unclassified Bacillus level and then assigns most of those reads to one or more of the species under that taxon even when there are multiple other species with more reads assigned by Kraken2. This happens with version 2.6 and 2.8 of Bracken. And I am pretty sure I have seen this with some other organism as well so I don't think this is just a Bacillus issue. I am happy to provide any files that may be helpful for figuring out what is going on here.

nphage commented 1 year ago

Nevermind. This was a mistake on my part. I had a script that was running Kraken2 and Bracken and when I updated the Kraken2 database I was using I had changed the database in the script for Kraken2, but forgot to update if for Bracken. So they were using two different databases which seems to have caused a variety of issues.

nphage commented 1 year ago

Ok, that was a bit premature. While I was accidentally using two different databases for Kraken2 and Bracken and switching to using the same database for both did fix another issue I was seeing it did not fix this issue. Even with the same database Bracken is still redistributing more reads than it seems like it should to some taxon like in the original post above.

One thing that helps reduce the impact of this issue is setting a threshold for Bracken to something like 10. And despite what the README says the default threshold for running Bracken is 0 and not 10. However, if one runs est_abundance.py directly it looks like the default threshold value there is 10.

jenniferlu717 commented 11 months ago

The distribution is not just based on which species has more reads, but also based on which species has more unique reads (reads not shared with others). Its a bit more complicated, but I understand the thresholding issue. I fixed the threshold to make it consistently 10 across both scripts.