jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
286 stars 50 forks source link

Strange Bracken results indicating Brassica Rapa reads in the Human RNASeq COVID Samples #243

Open Rohit-Satyam opened 8 months ago

Rohit-Satyam commented 8 months ago

Dear @jenniferlu717

I have been using Bracken for a long time but recently something changed. I have clinical COVID RNAseq samples (380) and I realized that in many samples Bracken shows the abundance of Brassica Rapa and other Plant-related reads such as Juglans regia. This is completely opposed to what we observe in Kraken2 results. Besides, these are human samples so I shouldn't observe Viridiplantae reads. I see that a similar misassignment has also been reported elsewhere

Version Used: Bracken v2.9

kraken2 --threads 12 --confidence 0.1 --paired --db PlusPF_database_indexes/ \
 file.R1.fastq.gz file.R1.fastq.gz --output $file --report-minimizer-data \
  --report file_kraken.report --gzip-compressed --unclassified-out file_unclassified#.fasta
bracken -r 100 -d <path> -i file_kraken.report -o file_abundance.txt -w file_bracken.report

image

Also attaching the results:

bracken_misassignment.zip

jenniferlu717 commented 8 months ago

I looked back into the code and what Bracken is trying to do is redistribute reads based on the kmer distribution file.

So for example, if any sequence (100bp for example) from Brassica Rapa is classified as Bacteria or Acinetobacter, then reads from those levels were redistributed to Brassica Rapa.

Are you using a complete genome of Brassica Rapa?

What I could do is require that any redistribution has to remain in the same tree. This normally doesn't cause problems unless genomes are contaminated.

Rohit-Satyam commented 8 months ago

Hi @jenniferlu717. My data is RNAseq (not genomes) but the read length is greater than 100 post trimming and I provide these trimmed fastq files to Kraken2. All I am using is Kraken2/ Bracken index PlusPF from here (Downloaded on Dec 8, 2023).

What I could do is require that any redistribution has to remain in the same tree. This normally doesn't cause problems unless genomes are contaminated.

Yep, that would be great. In addition, can you tell why there are missing taxa for the Bacteria in the Sankey plot (Bacteria directly linked to Pseudomonadota)? I observe that such a discrepancy has been reported in Pavian issue threads (see here and here) but this must be stemming from the Kraken2 reports right? This wasn't the issue with the previous indexes.

Rohit-Satyam commented 8 months ago

Hi, I was wondering if this is a quick fix and can be inculcated by the end of this week or if I have to wait for a few months. kindly let me know because I have a presentation coming so I would be able to include/exclude results accordingly. No hurry please take your time. I just want to know for my personal goals.

Thanks

Rohit-Satyam commented 6 months ago

Hi @jenniferlu717 I was wondering if this issue was fixed.