jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
293 stars 50 forks source link

Read length for nanopore #60

Open mbhall88 opened 5 years ago

mbhall88 commented 5 years ago

Upfront, I know Bracken wasn't necessarily designed to run on nanopore data.

For the read length parameter how would you recommend setting this? Median read length, average, minimum (as in #30 ), or a hard threshold?

jenniferlu717 commented 5 years ago

I honestly am hesitant to say as it could affect the results a bit. I have yet to test this on nanopore data where read lengths are so varied.

My gut says minimum read length. but I really would like to test this further before being certain.

narsapuramvijaykumar commented 5 years ago

@jenniferlu717

I'm currently facing a similar issue with IIumina HiSeq NGS read data with varied read length of 30-301Aa after QC (trimmomatic followed by FASTQC). Is this issue resolved or still in the face of some development. I could see in the readme.md, bracken easy version has some ways to tackle reads with multiple read length (see link below). If this suits my requirement please confirm. https://github.com/jenniferlu717/Bracken#running-bracken-easy-version

Thanks in advance ,

Regards, Vijay N

wolfgangrumpf commented 5 years ago

Facing the same problem here - we have a variety of sequencers generating anything from 150bp to 5kb (PacBio). I'm tempted to create two databases so that I can do chemistry-dependent analyses, but if the 150bp db would work for the longer reads, well, it would simplify handing this off to other folks. Any update on your tests, @jenniferlu717 ?

lancer-lu commented 5 years ago

From the paper we can know, length r is used to generate a database that the length of kmers is r, which is equal to the read length, then we can know how many k-mers are unique to genome Si. I am facing the same problems with you, but i still don't konw how to solve r, my read is from 150 to 300bp.

lancer-lu commented 5 years ago

@jenniferlu717

I'm currently facing a similar issue with IIumina HiSeq NGS read data with varied read length of 30-301Aa after QC (trimmomatic followed by FASTQC). Is this issue resolved or still in the face of some development. I could see in the readme.md, bracken easy version has some ways to tackle reads with multiple read length (see link below). If this suits my requirement please confirm. https://github.com/jenniferlu717/Bracken#running-bracken-easy-version

Thanks in advance ,

Regards, Vijay N

Have you solve this problem?

Midnighter commented 4 years ago

Hi @jenniferlu717,

Similarly to the other folks posting here, I was wondering about what kind of read length I should build a database for. I'm analyzing a fairly diverse dataset where reads are 45, 75, or 100 bp long. Additionally, I will have to trim some of the reads even further due to poor quality. Do you recommend preparing and using different databases or one database based on the minimum length?

Thank you for your insights!

Midnighter commented 2 years ago

I've been thinking a bit more about this and I'm actually wondering if Bracken is needed at all for long reads. I wonder if someone here has more experience because I would assume that with the long reads, kraken2 can match them quite specifically to one of the reference genomes. So I wonder if there is a need even to post-process with Bracken.

pgcudahy commented 1 year ago

I also have a diverse dataset with multiple read lengths. I'm thinking of setting to the minimum, but would appreciate any guidance.

jackwgoodall commented 1 year ago

Please can I ask if you have had a chance to test this @jenniferlu717? It would be great to know if we can use Bracken with confidence for nanopore.

Many thanks,

Jack

iaposto commented 1 year ago

Hello @Midnighter, I wonder if you have any updates on this issue. I am analysing nanopore 16s data (minION) and already classified them with Kraken2. Is further processing with Bracken necessary? If yes, is the mimimum read length the optimal choice? If not, how would one calculate the relative taxonomic abundance with Kraken2 output?

Many thanks in advance!

Midnighter commented 1 year ago

I don't have a real answer but I can say that we decided to not run Bracken on nanopore reads for taxprofiler.