Experimental design - fixed effects model

Iwo-K commented 3 years ago

Thank you very much for tackling the problems associated with differential expression for scRNA-Seq experiment in your paper!

I have a question regarding your suggestion about using MAST with fixed-effects, from this paragraph:

"While we recommend computing differential expression analysis using MAST with RE, alternative methods include using MAST with fixed-effects for individual, a tweedie GLMM, or permutation testing. Accounting for the within-sample correlation with a fixed-effect term for individual, will have a slight difference in interpretation, but should be considered an alternative option to random effects — particularly when the number of independent experimental units is modest."

I will use the scRNA-Seq dataset I am currently trying to analyse as an example. The design is as follows:	condition	mouse_id
control	Mouse1
control	Mouse2
control	Mouse3
treated	Mouse4
treated	Mouse5
treated	Mouse6

Each Mouse with approx 150 single cells sampled.

If I understood correctly your approach, you are suggesting to fit a random effects model, like this:

zlm(~condition + ngenes_scaled + (1 | mouse_id), 
        sca,
        method='glmer',
        ebayes=FALSE,
        strictConvergence = FALSE, 
        exprs_value = 'logcounts')

In my experiment there are only 3 animals, so I wasn't sure whether this is enough to fit the random effects model. You mentioned in your paper to consider a fixed effect model instead when the number of experimental units is small.
Would you mind explaining how the model with fixed effect should look like?

kdzimm commented 3 years ago

@Iwo-K Thanks for the kind remark and for the thoughtful question. I would think that either a mixed or fixed effects model will perform comparably here. There is an “asymptotic” assumption for random effects that is a stretch to make with 3 samples per group, so it might be worth running both to determine if they lead to similar stories. Overall, fixed effects might be more stable. To do that, just moved "mouse_id" outside of the "(1 | mouse_id)" random effect convention and move glmer to glm. The ebayes and strictConvergence statements will probably no longer be necessary too. Just make sure you are specifying "condition" in the LRT statement downstream for both scenarios.

I hope this helps!

Iwo-K commented 3 years ago

@kdzimm Thank you for answering to the question so quickly.

I've tested a few models. In my case the differences in gene expression are expected to be quite small (and indeed are), so I am seeking for the most sensitive method, that is still statistically sound.

In case anyone is interested in the outcome. For the zlm models I used the summary(model, doLRT='codnitiontreated') to perform the test, as suggested above. I compared:

Simply treating all cells as replicates (if I understand correctly this is not recommended): m1 = zlm(~ condition + n_genes_scaled, sca)

With fixed effect to account for mouse effects: m2 = zlm(~ condition + n_genes_scaled + mouse_id, sca)

Pseudobulk analysis with ~condition formula. pb - pseudobulk analisys with edgeR with expression profiles summed up for cells from each mouse

Venn diagram of DE genes: The overlap between m2 and pb is modest, I suspect this is because of the lack of power.

I've also checked the outcome of running the random effect model (which causes some convergence warnings but manages to run for most genes): m3 = zlm(~time + ngenes_scaled + (1 | mouse_id), sca, method='glmer', ebayes=FALSE, strictConvergence = FALSE)

This yields only 10 differentially expressed genes, but 7 of them are common with the fixed-effect model (m2).

Thanks a lot for the help!

kdzimm commented 3 years ago

@Iwo-K Really nice, thorough work. Here are some thoughts:

1) You're exactly right. m1 is pseudoreplication. These should not be used.

2) It is interesting that the pseudobulk method pulled out more genes. I'd be interested to see if you did a boxplot of those gene expression values for each cell split out by each sample on the x axis what those genes (the ones not significant in m2, but significant in pb) would look like. I'd wager that the aggregation might be slightly biasing your results if there are more cells in your treatment group than in your control group. For example, imagine if you had a mean of 140 cells in each of your control samples, but a mean of 160 cells in each of your treatment samples. Even if each cell has identical expression of 30, the sum of the expression values will all hover around 480 in your treatment group and 420 in your control group, which appears different but it is only just because there are more cells per sample. This is an extreme example, of course, but I hope it gets you to understand the issue of summation. This could also happen if the cells that are expressing show large values and there are slight differences in the proportion of expressed/not-expressed between conditions (MAST's hurdle model should pick these up if the difference is truly significant). Using these results would be reasonable, but I would look at some of the boxplots first to make sure what you are finding is real and not due to differences in the number of expressing cells that were captured. And if the result is due to differences in the number of expressing cells, just check that the difference in the number of expressing cells in each condition is, in fact, significant. I would think it won't be, otherwise MAST's hurdle model should have picked it up.

3) Personally, if the fixed-effect model is more sensitive and you aren't getting the convergence warnings, I would probably stick with that. But I don't think you can go wrong with either m2 or m3.

I am going to close this thread, but if you have any more questions, please feel free to reopen this.

Take care,

Kip

Iwo-K commented 3 years ago

Thanks a lot for the helpful suggestions, I will have a closer look at the DE genes considering the cell proportions as well!

All the best, Iwo

kdzimm / PseudoreplicationPaper

Experimental design - fixed effects model #3