Abundance calculation question

This is not something that can really be handled at the software level for either Kraken or kraken-biom. It's really just a feature of the sequencing process and the fact that Kraken uses kmers to identify organisms. There's a similar issue in amplicon (16S) sequencing. Many organisms have multiple copies of the 16S gene (Some Strep. species have up to 9). There have been attempts to normalize abundance results by copy number, but actually getting a complete database of copy numbers is not simple.

So, I would point you to Mick Watson's excellent article The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies. He has a couple sections in the article that discuss copy number and strategies for correction. He ends with the statement:

It is also important to note that in comparisons of OTUs between samples rather than within a sample (e.g., in comparisons of treatment effects), the impact of copy number variation is reduced, as the under- or overrepresentation of OTUs would be consistent across samples as long as the same methodology had been used.

It's not quite the same situation in metagenomics, but the principle is similar.

If you do really want to ensure you are getting one count per cell, then I would suggest you look into software like SingleM that uses a database of true Single Copy genes and runs HMMs trained on those genes against your metagenomic reads to obtain counts.

smdabdoub / kraken-biom

Abundance calculation question #19