Closed the-eon-flux closed 6 months ago
It sounds fine to me if you concatenate but only if the unclassified reads from DB 1 were being used against DB 2. I think if theres any overlap in what reads are classified, it doesnt work. I have not tested to see if this would give wildly different results but assuming the genomes are complete genomes and not contaminated, it should work(?)
I managed to create a database with genomes from both of these databases. And I could successfully run Kraken2 on the files. But thank you for the discussion and thus I'm closing this issue.
Hi all,
I couldn't find relevant discussions or information on the topic and that's why I am initiating this discussion here. I tried searching for this topic in this GitHub repo, but couldn't find it.
So, I am trying to classify my metagenomic reads using 2 standard databases (GTDB & PlusPF ) separately. I ran Kraken-Braken (with the same database for both steps) on all my samples. So, now I have 2 feature tables (1 with the GTDB read counts and 1 with the PlusPF read counts).
Is it valid to merge unique features (for each database) from different databases for classification purposes? This is for expanding taxonomic coverage. I want to study the ecology of the taxa within these samples.
Why merge 2 different database features? GTDB has standardized and reassigned bacterial and archaeal genomes/taxa with the phylogeny information in addition to the genomic input. Therefore, I get more accurate within Bacterial and Archaeal kingdom classifications.
PlusPF database is created from genomes of almost all domains of life (excluding plant genomes). My samples most likely have a lot of viral genomes as well. I am only interested in the fungal and viral taxa from this database.
I read the Kraken and Braken papers, and if I understood it correctly each DNA-read is uniquely assigned to a taxon (within a database). And if the GTDB has only archaeal and bacterial genomes, then the unclassified reads should belong to the missing taxa/genomes. So, I am just adding the classification for those unclassified reads.
Unfortunately, merging the 2 databases is a huge task on its own. I don't want to go there. Is it fine if take only the bacterial read counts from GTDB bracken output (exclude unclassified of course) and just add the list of viral/fungal read counts (if any) in a given sample?
Does my logic make sense? Or will this violate any assumptions? Any pointers or discussion is welcome! Thank you for your time.