thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
64 stars 11 forks source link

Best measures for gene Fold change between pre- and post-treatment samples and for gene comparison within a sample #60

Closed mpunta closed 2 years ago

mpunta commented 2 years ago

Hi, new to bulkRNAseq analysis and a bit confused about different normalization measures that are used.

My understanding is that TPMs are fine for comparison of gene expression levels within a sample while for looking at fold change expression of a list of genes of interest in two samples (e.g. pre and post treatment) the way to go would be to transform the estimated counts from e.g. kallisto (imported with tximport using the CountsfromAbindance = "no" option) with the edgeR pipeline described in the tximport vignette. What confuses me is:

1) what is obtained at the end of the edgeR code described in the vignette are called CPMs (cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE) ) while to me these appear very similar to TMMs instead (since they are calculated accounting both for different average transcript length between samples and for composition biases between samples). Am I getting this wrong?

2) the tximport vignette appears to suggest that processing the data with the estimated ("raw") counts with the edgeR pipeline has a similar effect as using scaledTPM or lengthscaledTPM but in fact what is obtained are very different counts. Quoting from one of your answers in (https://support.bioconductor.org/p/84883/): "These two have library size differences baked in. The column sum is equal to the number of mapped reads. So not comparable across samples." While as per 1) above I believe that the "CPMs" obtained with the edgeR vignette CAN be used for across sample comparison. Am I right?

In conclusion, what measure would you suggest to use for calculating logFC of a list of genes of interest between a pre and a post treatment sample? Is it the "CPMs" as obtained via the edgeR pipeline in the tximport vignette or something else? And if the first, I guess those "CMPs" could be used also for looking at how pairs of genes correlate in different samples (one point per sample in a correlation plot) instead of using the TPMs, correct?

Thank you very much in advance for any help you will be able to give me.

Marco

mpunta commented 2 years ago

Apologies this was posted under tximeta by mistake, I have now reposted it in the tximport space.