Using TPM normalised vs raw count expression data

Hi Rik,

If you use TPM count data, then we still tend to use a neg binom distribution via DESeq2 for our standard RNA-Seq analyses, and we still use glmmSeq for mixed models such as longitudinal data. The sizeFactors argument is optional. One of our team prefers using TMM count data which is generated via edgeR. TMM already includes the normalisation for library size, so when we use TMM count data matrix as input, we leave sizeFactors = NULL.

If you use the fully normalised and transformed data e.g. VST, then we have added the lmmSeq function which fits gaussian mixed models using lmer (much faster than glmer). You could take the TPM data and apply log2(TPM +1) and then use lmmSeq.

An offset argument has been included with lmmSeq, but only for completeness. This allows the user to include an offset in the model. However, note this is more complicated compared to a GLMM. For the GLMM the offset in each model is log(sizeFactors). Since the offset is added to the linear model and everything is on a log scale this is equivalent to the counts being scaled to (or "divided by") the sizeFactor for library size. But for a gaussian model the offset is simply added to the linear model and it does not have a beta coefficient. Note that for the GLMM sizeFactors needs to be centred around 1 (since log 1 = 0). For a LMM the offset should be centred around 0 across the samples.

So there will be some mathematical situations where some users do want to manipulate the offset for a linear mixed model, but hopefully these are power users who are aware of the consequences!

Bw, Myles

myles-lewis / glmmSeq

Using TPM normalised vs raw count expression data #20