saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.
https://saezlab.github.io/decoupleR/
GNU General Public License v3.0
176 stars 23 forks source link

How to deal with gene readouts that are biased by gene length? #103

Closed jgoldmann closed 7 months ago

jgoldmann commented 7 months ago

Hi,

Thanks a lot for putting together this package! There is a thing that I am wondering about: When analyzing gene expression, typically one would look first at the read count data by genes, as these are typically used as input for differential expression calling with DESeq2. I.e., the table would look like this:

gene,        expression
long_gene_A,          5
long_gene_B,        500
short_gene_C,         5
short_gene_D,       500

The side-effect of this is that the expression value as measured counts is biased by gene length. Measuring 500 read counts for a very short gene is very different from measuring 500 read counts from a long gene. How does decoupleR deal with this? Does it expect a gene expression value that is corrected for gene length?

PauBadiaM commented 7 months ago

Hi @jgoldmann,

decoupleR is agnostic to upstream preprocessing choices, we leave that to the users to decide. If in your data you think that gene length might play a big role, I would check if and how DESeq2 or any of DEA frameworks deal with it. Hope this is helpful!

jgoldmann commented 7 months ago

Thank you, in that case I will not use read counts but some derived measure corrected for gene length, like rpkm or tpm.