myles-lewis / glmmSeq

Gene-level general linear mixed model
https://myles-lewis.github.io/glmmSeq/
Other
18 stars 10 forks source link

vector memory exhausted issue #31

Open leqi0001 opened 1 year ago

leqi0001 commented 1 year ago

Hi,

Thanks for developing this package!

I'm following the vignette and trying to run glmmSeq on a relatively small dataset (26 samples * 20k genes). The 26 samples are 13 pairs as a random effects variable of (1|individual). If I'm using the model ~disease+(1|condition)+covar1+covar2+covar3+covar4, R will give me Error: cannot allocate vector of size 6223.5 Gb. It runs ok if I remove 1 fixed effect variable. It wouldn't run on an HPC either, and I suppose no cores can handle a vector of this size.

myles-lewis commented 1 year ago

Hi leqi0001,

Thanks. I haven't seen this error before. I suggest you try to isolate the issue as follows:

  1. Take a column of data from just 1 gene
  2. Apply log2+1 so that it is converted to be more gaussian
  3. Add your metadata

Fit your model using: fit <- lme4::lmer(formula, data) where you formula is of the form gene ~ disease+(1|condition)+covar1+covar2+covar3+covar4

Examine the result using summary(fit) See if this can work on a single gene. If this works, then move to trying the neg binom model: fit <- lme4::glmer(formula, data, family = MASS::negative.binomial(theta = 1/disp))

Try fixing the dispersion disp to a simple value e.g. 1, which makes the model simpler as it is essentially a Poisson model. This time you'll need to provide count data not gaussian data: count ~ disease+(1|condition)+covar1+covar2+covar3+covar4

This way you will find out whether a mixed model of such a magnitude is feasible.

I suspect the model is too large. Mixed models get big quickly because in essence there's a regression for each 'individual' or random effect level.

Best, Myles