leylabmpi / Struo2

Scalable creating/updating of metagenome profiling databases
MIT License
58 stars 8 forks source link

GTDB 207 Kraken db vs maxikraken2_1903_140GB db classification rate #44

Open ohickl opened 1 year ago

ohickl commented 1 year ago

Hi,

I have some trouble understanding the differences in classification rates between your GTDB 207 release Kraken database and the widely used maxikraken db from 2019, which is roughly half the size. I am classifying ~150 human stool sample metagenomes with kraken2 (2.1.2), using a 0.75 confidence score and default parameters otherwise and am consistently getting a ~10% higher unclassified rate with the GTDB database. This seems to stem a higher classification rate of bacteria in the maxikraken db. On the other hand I do get substantially higher sensitivity for Archaea with the GTDB one. Example (only highest levels): GTDB 207:

 37.77  12279770        12279770        U       0       unclassified
 62.23  20227993        68658   R       1       root
 62.01  20159210        3273025 D       609216830         Bacteria
  0.00  125     0       D       2587168575        Archaea

maxikraken:

 26.57  8637310 8637310 U       0       unclassified
 73.43  23870453        5007    R       1       root
 73.19  23793211        1103748 D       2           Bacteria
  0.00  68      0       D       2157        Archaea
  0.00  68      0       D       2759        Eukaryota
  0.00  257     0       D       10239     Viruses

I am confused as to why that is. I could understand that, given the much higher information content in the GTDB db, some classifications would be 'pushed' higher in the tax hierarchy with the confidence threshold used, as it turns out that with more data some k-mers aren't specific/unique for a taxon at that rank anymore. But since in my case they aren't even pushed to the root node but to unclassified, it seems to me that there are quite some k-mers that are just entirely missing from the GTDB db but present in the maxikraken one? Is this expected?

Best Oskar

nick-youngblut commented 1 year ago

The basic answer is the difference in representation between the 2 databases. You should be able to determine how microbial detection differs at resolved taxonomic levels (e.g., genus) by comparing the abundances of taxonomic groups classified by each database (e.g., x-axis: genus-gtdb, y-axis: genus-maxkraken, point size or color: median relative abundance across metagenome samples).

I'm guessing that the maxkraken database is more biased towards certain bacteria, especially well-characterized ones that are often found in the gut, while the GTDB includes a very broad representation of microbes found in all biomes.