Open ohickl opened 1 year ago
The basic answer is the difference in representation between the 2 databases. You should be able to determine how microbial detection differs at resolved taxonomic levels (e.g., genus) by comparing the abundances of taxonomic groups classified by each database (e.g., x-axis: genus-gtdb, y-axis: genus-maxkraken, point size or color: median relative abundance across metagenome samples).
I'm guessing that the maxkraken database is more biased towards certain bacteria, especially well-characterized ones that are often found in the gut, while the GTDB includes a very broad representation of microbes found in all biomes.
Hi,
I have some trouble understanding the differences in classification rates between your GTDB 207 release Kraken database and the widely used maxikraken db from 2019, which is roughly half the size. I am classifying ~150 human stool sample metagenomes with kraken2 (2.1.2), using a 0.75 confidence score and default parameters otherwise and am consistently getting a ~10% higher unclassified rate with the GTDB database. This seems to stem a higher classification rate of bacteria in the maxikraken db. On the other hand I do get substantially higher sensitivity for Archaea with the GTDB one. Example (only highest levels): GTDB 207:
maxikraken:
I am confused as to why that is. I could understand that, given the much higher information content in the GTDB db, some classifications would be 'pushed' higher in the tax hierarchy with the confidence threshold used, as it turns out that with more data some k-mers aren't specific/unique for a taxon at that rank anymore. But since in my case they aren't even pushed to the root node but to unclassified, it seems to me that there are quite some k-mers that are just entirely missing from the GTDB db but present in the maxikraken one? Is this expected?
Best Oskar