jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
294 stars 50 forks source link

Question: Would it be possible to concatenate the braken read assignments from 2 different databases (1 bacterial database & 1 viral database)? #250

Closed the-eon-flux closed 6 months ago

the-eon-flux commented 8 months ago

Hi all,

I couldn't find relevant discussions or information on the topic and that's why I am initiating this discussion here. I tried searching for this topic in this GitHub repo, but couldn't find it.

So, I am trying to classify my metagenomic reads using 2 standard databases (GTDB & PlusPF ) separately. I ran Kraken-Braken (with the same database for both steps) on all my samples. So, now I have 2 feature tables (1 with the GTDB read counts and 1 with the PlusPF read counts).

Is it valid to merge unique features (for each database) from different databases for classification purposes? This is for expanding taxonomic coverage. I want to study the ecology of the taxa within these samples.

Why merge 2 different database features? GTDB has standardized and reassigned bacterial and archaeal genomes/taxa with the phylogeny information in addition to the genomic input. Therefore, I get more accurate within Bacterial and Archaeal kingdom classifications.

PlusPF database is created from genomes of almost all domains of life (excluding plant genomes). My samples most likely have a lot of viral genomes as well. I am only interested in the fungal and viral taxa from this database.

I read the Kraken and Braken papers, and if I understood it correctly each DNA-read is uniquely assigned to a taxon (within a database). And if the GTDB has only archaeal and bacterial genomes, then the unclassified reads should belong to the missing taxa/genomes. So, I am just adding the classification for those unclassified reads.

Unfortunately, merging the 2 databases is a huge task on its own. I don't want to go there. Is it fine if take only the bacterial read counts from GTDB bracken output (exclude unclassified of course) and just add the list of viral/fungal read counts (if any) in a given sample?

Does my logic make sense? Or will this violate any assumptions? Any pointers or discussion is welcome! Thank you for your time.

jenniferlu717 commented 8 months ago

It sounds fine to me if you concatenate but only if the unclassified reads from DB 1 were being used against DB 2. I think if theres any overlap in what reads are classified, it doesnt work. I have not tested to see if this would give wildly different results but assuming the genomes are complete genomes and not contaminated, it should work(?)

the-eon-flux commented 6 months ago

I managed to create a database with genomes from both of these databases. And I could successfully run Kraken2 on the files. But thank you for the discussion and thus I'm closing this issue.