myles-lewis / glmmSeq

Gene-level general linear mixed model
https://myles-lewis.github.io/glmmSeq/
Other
18 stars 10 forks source link

Using TPM normalised vs raw count expression data #20

Closed RikLindeboom closed 2 years ago

RikLindeboom commented 2 years ago

Hello,

Great package, many thanks for making this! I had a question regarding the example in the vignette provided. The example uses TPM normalised expression data, and with this data it calculates size factors, dispersions and fits it to a NB.

I was wondering if this is the recommended way of using this package? A negative binominal distribution can be unsuitable for TPM normalised data according to some (see for example https://biostars.org/p/471335/ ), and I'm confused why it would still be useful to calculate size factors on normalised data?

Thanks in advance. Best wishes, Rik

myles-lewis commented 2 years ago

Hi Rik,

If you use TPM count data, then we still tend to use a neg binom distribution via DESeq2 for our standard RNA-Seq analyses, and we still use glmmSeq for mixed models such as longitudinal data. The sizeFactors argument is optional. One of our team prefers using TMM count data which is generated via edgeR. TMM already includes the normalisation for library size, so when we use TMM count data matrix as input, we leave sizeFactors = NULL.

If you use the fully normalised and transformed data e.g. VST, then we have added the lmmSeq function which fits gaussian mixed models using lmer (much faster than glmer). You could take the TPM data and apply log2(TPM +1) and then use lmmSeq.

An offset argument has been included with lmmSeq, but only for completeness. This allows the user to include an offset in the model. However, note this is more complicated compared to a GLMM. For the GLMM the offset in each model is log(sizeFactors). Since the offset is added to the linear model and everything is on a log scale this is equivalent to the counts being scaled to (or "divided by") the sizeFactor for library size. But for a gaussian model the offset is simply added to the linear model and it does not have a beta coefficient. Note that for the GLMM sizeFactors needs to be centred around 1 (since log 1 = 0). For a LMM the offset should be centred around 0 across the samples.

So there will be some mathematical situations where some users do want to manipulate the offset for a linear mixed model, but hopefully these are power users who are aware of the consequences!

Bw, Myles