smdabdoub / kraken-biom

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
MIT License
47 stars 15 forks source link

Abundance calculation question #19

Closed Valentin-Bio-zz closed 2 years ago

Valentin-Bio-zz commented 2 years ago

Hello I used kraken-biom to perfrom an abundance calculation from kraken2 taxonomic classification output.

My question is given a classification made from shotgun metagenomic reads. kraken2 can assign more than two reads (for this example lets say 4) that proceed from the same cell as the same species. at this point when calculating the relative abundance of a species it can mean that this species is present 4 times on the sample but in reality those 4 reads came from the same cell so the real relative abundance is 1.

How kraken-biom handles this? or is it handled under kraken2 classification?

smdabdoub commented 2 years ago

This is not something that can really be handled at the software level for either Kraken or kraken-biom. It's really just a feature of the sequencing process and the fact that Kraken uses kmers to identify organisms. There's a similar issue in amplicon (16S) sequencing. Many organisms have multiple copies of the 16S gene (Some Strep. species have up to 9). There have been attempts to normalize abundance results by copy number, but actually getting a complete database of copy numbers is not simple.

So, I would point you to Mick Watson's excellent article The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies. He has a couple sections in the article that discuss copy number and strategies for correction. He ends with the statement:

It is also important to note that in comparisons of OTUs between samples rather than within a sample (e.g., in comparisons of treatment effects), the impact of copy number variation is reduced, as the under- or overrepresentation of OTUs would be consistent across samples as long as the same methodology had been used.

It's not quite the same situation in metagenomics, but the principle is similar.

If you do really want to ensure you are getting one count per cell, then I would suggest you look into software like SingleM that uses a database of true Single Copy genes and runs HMMs trained on those genes against your metagenomic reads to obtain counts.