Open NienkeMekkes opened 4 years ago
It may be that there are no strains apart from the E. coli strains in the Kraken report.
The percentages are recalculated for only the reads that are found at the level you're estimating at.
So in other words, if there are no strains listed for any other species, the read percentages will be off.
What you should do in this case is compare read counts, not just the percentage levels.
Dear authors,
I have been using Bracken to experiment on strain level classifications. I find the tool very useful. I do have a question regarding output. I test on reads from a gut microbiome standard (known input abundances), and use a testing Kraken2/Bracken database containing only the specific strains/genomes (but with full taxonomy of course) in the sample. For clarification, this input standard contains different species, and also 5 different e.coli strains. I noticed the following behaviour:
At species level the abundances are estimated the way I expected: name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads Saccharomyces cerevisiae 4932 S 7290 13 7303 0.01037 Fusobacterium nucleatum 851 S 17850 0 17850 0.02535 Faecalibacterium prausnitzii 853 S 98999 17 99016 0.14059 Escherichia coli 562 S 89813 265 90078 0.12790 Etc...
At S1 level, the following behaviour occurred: name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads Escherichia_coli_B3008 9999994 S1 6551 11507 18058 0.13867 Escherichia_coli_JM109 9999991 S1 2203 55610 57813 0.44396 Escherichia_coli_B1109 9999993 S1 3790 15849 19639 0.15082 Escherichia coli W 566546 S1 11122 7614 18736 0.14388 Escherichia_coli_b2207 9999992 S1 4539 11436 15975 0.12268
What I noticed, is that at S1 level, the 5 e.coli abundances add up to 100%, while at species level, E.coli made up only 12.79%. I expected the S1 abundances to add up to this 12.79%. I think this happened because in my database, I only listed the 5 different e.coli strains as "strain", while I listed the other genomes as "species".
However, does this mean that Bracken always adds up to 100% at a certain taxon level (G, S, S1)? Even when the taxon level above gives information about the abundance in the sample? Follow up question, can I combine the S1 and S abundance to calculate the "actual" abundance? Example: If E.coli makes up 12.79% of my sample, and "E.coli strain A" is found at 50%, then the abundance of "E.coli strain A" in the sample would be 6.4%.