Lower nr of reads at Class compared to Order level Bracken

jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

http://ccb.jhu.edu/software/bracken/index.shtml

GNU General Public License v3.0

273 stars 50 forks source link

Lower nr of reads at Class compared to Order level Bracken #257

Open CETHuyghe opened 3 months ago

CETHuyghe commented 3 months ago

Hi,

I was using Bracken to re-estimate reads at Class level for Eukaryotes. Now, there is a certain fish Order that I don't want included in the general fish Class, because it's the Order of my host species and the risk that those reads are actually from the host and not the Eukaryotic community is too high.

When I tried to retract the reads re-assigned to this fish Order by Bracken, from the reads assigned to the fish Class, I got several negative numbers for the abundance of this fish Class. I don't understand how this is possible since Bracken will re-assign reads from higher and lower taxonomic levels to a certain taxonomic level. I would even expect that the higher the taxonomic level, the more reads it has assigned. This doesn't seem the case here where the higher taxonomic level has a lower number of reads than the lower taxonomic level.

Additionally, in this case when looking at a higher tax level than Genus or Species, might it be better to use the Kraken2 output and just sum the reads assigned to lower taxonomic levels up to Class and Phylum level?

I would be very grateful for some feedback on this.

aldertzomer commented 2 months ago

We observe a similar problem with Species and Genus. counts are missing in the Genus output from Bracken. This is from an older version of Bracken (the code says 2.2).

BirgitRijvers commented 1 month ago

Hi, I have encountered the same problem, also with species and genus levels.

When I run Bracken on the default settings (species level), the read counts seem to be correct and only differ slightly from the Kraken2 reports. However, when I run Bracken on genus level (with -l G), almost all of my reads dissapear and this results in only a few of the genera originally in the Kraken2 report being in the Bracken output.

For example: When I count all fragments for sample X in the Kraken2 report, it adds up to 62945533 fragments. When I do the same for my Bracken species level report, I get 63073387 fragments in total. However, when I do this for the Bracken genus level report only 1070 fragments are left. This results in 200+ genera also not being reported, while they are in both the Kraken2 and Bracken species level reports.

I am using Bracken version 2.9, installed with Conda.

@jenniferlu717 If more information is needed, please let me know. I would greatly appreciate any advice on this issue!

Kamouyiaraki commented 1 month ago

Hi,

I'm having the same issue (hence how I ended up here) both when running with -l F and without. Has anyone found a solution yet?

jenniferlu717 commented 1 month ago

@CETHuyghe what version of Bracken are you using? Bracken in one version did reasign reads above and below the taxonomy but I found this caused problems when tbere was contamination in the genomes/shared sequences between very unrelated clades so counts got reassigned poorly. It no longer reassigns reads below.

@aldertzomer this may also explain your problem.

@BirgitRijvers do you have a threshold set when running the code? Bracken may remove genera that have read counts below the threshold. I realize this is confusing because -t is not threads for running Bracken, but a threshold min of read counts. Also Bracken may have slightly different total read counts due to rounding but this normally yields slightly less

@BirgitRijvers it would help to see the screen output after running or even if you email me your kraken report and Bracken Xmers.kmer_distrib file to jennifer.lu717@gmail.com

@Kamouyiaraki ill need more details to see the problem

CETHuyghe commented 1 month ago

Hi,

Thank you for coming back to this. The version of Bracken that I use is 2.5

BirgitRijvers commented 1 month ago

Hi Jennifer,

Thank you for looking into this.

I just figured out what was causing my issue, and it is not Bracken... Since I have a large amount of samples that I want to process after running Kraken2 and Bracken, I use the tool kraken-biom to combine all Bracken reports into one BIOM file. This works just fine when using Bracken output on species level, but unfortunately removes classifications when using Bracken on a different taxonomic level. So, my Bracken reports contain all information as expected, but the BIOM file does not. This is also explained in an issue in the kraken-biom repository.

Apologies for the oversight, I really should have checked this earlier. Thank you again for your time!

aldertzomer commented 1 month ago

@jenniferlu717 Thank you! I will discuss with my coworkers. I think we can now solve it :)

skose82 commented 1 week ago

@CETHuyghe what version of Bracken are you using? Bracken in one version did reasign reads above and below the taxonomy but I found this caused problems when tbere was contamination in the genomes/shared sequences between very unrelated clades so counts got reassigned poorly. It no longer reassigns reads below.

@aldertzomer this may also explain your problem.

@BirgitRijvers do you have a threshold set when running the code? Bracken may remove genera that have read counts below the threshold. I realize this is confusing because -t is not threads for running Bracken, but a threshold min of read counts. Also Bracken may have slightly different total read counts due to rounding but this normally yields slightly less

@BirgitRijvers it would help to see the screen output after running or even if you email me your kraken report and Bracken Xmers.kmer_distrib file to jennifer.lu717@gmail.com

@Kamouyiaraki ill need more details to see the problem

Hi @jenniferlu717

I wanted to follow up your answer of -t as threshold for minimum reads/not threads. Is there a default minimum that you would recommend for standard metagenomics/transcriptomics data?

Thanks.