jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
374 stars 80 forks source link

Compare genes proportion among samples #62

Closed fconstancias closed 4 years ago

fconstancias commented 4 years ago

What would be the best strategy to compare the genes proportion among different samples. can I directly compare tpm values or should I subsample the count values (e.g., KEGG abundance data) to the same number of sequences per sample and compare those 'rarefied' values?

Should I take into consideration the sample's mapping rates against the coassembly?

fpusan commented 4 years ago

TL;DR: it's complicated but don't rarefy.

No need to rarefy, and some authors would be pretty strongly against it. We normally don't consider mapping rates against the coassembly. As long as the mapping rate for a sample is c.a. 70% or more, we will consider that sample "good enough". In the end it comes up to how you want to treat unclassified reads in your analysis.

TPM can be directly compared among different samples, and we routinely use it for preliminary analyses and visualization. Note however that sequencing data are compositional and TPM does not correct for this (see e.g https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full). Some analysis methods will be more or less robust to this issue (e.g. Spearman correlations are more robust than Pearson correlations, but methods that address the problem explicitly, such as SparCC, are always a safer bet). Most statistical methods intended for the analysis of sequencing data (SparCC, DESeq2...), work with raw sequencing counts and normalize the data internally. Centered-log-ratio transformation is also becoming popular to transform raw abundance compositional matrices into an euclidean space, which in theory should render them amenable to analysis using common statistical methods such as PCA. Still, its use is also contested (https://stats.stackexchange.com/questions/305965/can-i-use-the-clr-centered-log-ratio-transformation-to-prepare-data-for-pca).

While we are confident that SqueezeMeta provides a reasonable "one-size-fits-all" pipeline for metagenomic analysis, we can't say the same about the statistical analyses that comes after. When working with our own data, we would decide what to do in a case-by-case basis and after carefully considering the literature. As such, we do not make strong recommendations on how to perform it.

fconstancias commented 4 years ago

Hi all,

Thanks again for developing the fantastic pipeline. TPM can be directly compared among different samples.

TPM (transcript per million) is a bit misleading in the context of DNA based gene catalog. Shouldn't we use count per million reads (CPM) / mapped reads per millions ?

Best

fpusan commented 4 years ago

No, as TPM is not actually the number of counts from that gene that you observe per million counts.

See this excerpt from our recently published paper (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03703-2)

The TPM (transcripts per million) metric was introduced by Wagner et al. [14] as an improved way to account for gene length and sequencing depth in transcriptomic experiments: we find it equally useful in metagenomics. The TPM of a feature (be it a transcript, a gene or a functional category) is the number of times that we would find that feature when randomly sampling 1 million features, given the abundances of the different features in our sample. [...]. For the sake of being consistent with previous works, we maintain the nomenclature “TPM”, even when use it to measure the abundances of features other than transcripts.

So strickly speaking we could use the more generic FPM - Features Per Million, but we find it less confusing to stick to the existing nomenclature.