Closed esrasefik closed 3 years ago
We currently do not support this, but you may try altering our MAST wrapper to incorporate a random effect. The hurdle model is specified by the formula
parameter in the zlm
function, called in MASTDETest
:
zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
@esrasefik just wondering if you've gotten this to work, and if you wouldn't mind sharing your script? I am trying to do the same with my differential expression analysis.
@jgamache014 - I unfortunately haven't been able to. I decided to switch to using MAST directly instead of using the Seurat interface given the little support the developers provided for this issue. I don't yet have a solution I can share - still working on it. Sorry!
@esrasefik Thanks, I'll try using MAST directly as well!
@jgamache014 @esrasefik have any of you succeeded? If not, what have been the stumbling blocks? I'd like to have a crack at it.
This clunky solution will add this functionality to FindMarkers
by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.
Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.
# We want a different behavior from MASTDETest, where the latent variable is
# used as a random effect in the model
rem_MASTDETest <- function(
data.use,
cells.1,
cells.2,
latent.vars = NULL,
verbose = TRUE,
# New option - random effect variable (should be included in latent.vars)
re.var = NULL,
...
) {
# Check for MAST
if (!PackageCheck('MAST', error = FALSE)) {
stop("Please install MAST - learn more at https://github.com/RGLab/MAST")
}
group.info <- data.frame(row.names = c(cells.1, cells.2))
latent.vars <- latent.vars %||% group.info
group.info[cells.1, "group"] <- "Group1"
group.info[cells.2, "group"] <- "Group2"
group.info[, "group"] <- factor(x = group.info[, "group"])
latent.vars.names <- c("condition", colnames(x = latent.vars))
latent.vars <- cbind(latent.vars, group.info)
latent.vars$wellKey <- rownames(x = latent.vars)
fdat <- data.frame(rownames(x = data.use))
colnames(x = fdat)[1] <- "primerid"
rownames(x = fdat) <- fdat[, 1]
sca <- MAST::FromMatrix(
exprsArray = as.matrix(x = data.use),
check_sanity = FALSE,
cData = latent.vars,
fData = fdat
)
cond <- factor(x = SummarizedExperiment::colData(sca)$group)
cond <- relevel(x = cond, ref = "Group1")
SummarizedExperiment::colData(sca)$condition <- cond
# This is the main change in the code - we want ~ ( 1 | re.var) in the formula:
# ~ condition + lat.vars + (1 | re.var)
if (!is.null(re.var)) {
if (!re.var %in% latent.vars.names) {
stop("Random effect variable should be included in latent variables!")
}
latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var]
fmla <- as.formula(
object = paste0(
" ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})")
)
)
# print(fmla)
# We need glmer to make this work
method <- "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call
zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...)
} else {
# Original code
fmla <- as.formula(
object = paste0(" ~ ", paste(latent.vars.names, collapse = "+"))
)
zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
}
summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2')
summaryDt <- summaryCond$datatable
# The output format is slightly different, so we need adapt the code
if(!is.null(re.var)) {
p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)`
genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid
} else {
p_val <- summaryDt[summaryDt[, "component"] == "H", 4]
genes.return <- summaryDt[summaryDt[, "component"] == "H", 1]
}
to.return <- data.frame(p_val, row.names = genes.return)
return(to.return)
}
# This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash
my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE,
ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE,
LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx,
...)
{
dotsdata = list(...)$data
if (!is.null(dotsdata)) {
if (!missing(sca))
stop("Cannot provide both `sca` and `data`")
sca = dotsdata
}
if (!inherits(sca, "SingleCellAssay")) {
if (inherits(sca, "data.frame")) {
if (!is.null(dotsdata)) {
return(.zlm(formula, method = method, silent = silent,
...))
}
else {
return(.zlm(formula, data = sca, method = method,
silent = silent, ...))
}
}
else {
stop("`sca` must inherit from `data.frame` or `SingleCellAssay`")
}
}
if (missing(LMlike)) {
method <- match.arg(method, MAST:::methodDict[, keyword])
method <- MAST:::methodDict[keyword == method, lmMethod]
if (!is(sca, "SingleCellAssay"))
stop("'sca' must be (or inherit) 'SingleCellAssay'")
if (!is(formula, "formula"))
stop("'formula' must be class 'formula'")
Formula <- MAST:::removeResponse(formula)
priorVar <- 1
priorDOF <- 0
if (ebayes) {
# if (!methodDict[lmMethod == method, implementsEbayes])
if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes]))
stop("Method", method, " does not implement empirical bayes variance shrinkage.")
ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl,
model.matrix(Formula, SummarizedExperiment::colData(sca)))
priorVar <- ebparm[["v"]]
priorDOF <- ebparm[["df"]]
stopifnot(all(!is.na(ebparm)))
}
obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca),
formula = Formula, priorVar = priorVar, priorDOF = priorDOF,
extra = list(...))
}
else {
if (!missing(formula))
warning("Ignoring formula and using model defined in 'objLMLike'")
if (!inherits(LMlike, "LMlike"))
stop("'LMlike' must inherit from class 'LMlike'")
obj <- LMlike
}
ee <- t(SummarizedExperiment:::assay(sca, exprs_values))
genes <- colnames(ee)
ng <- length(genes)
MM <- MAST:::model.matrix(obj)
coefNames <- colnames(MM)
listEE <- setNames(seq_len(ng), genes)
obj <- MAST:::fit(obj, ee[, 1], silent = silent)
nerror <- totalerr <- 0
pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures")
.fitGeneSet <- function(idx) {
hookOut <- NULL
tt <- try({
obj <- MAST:::fit(obj, response = ee[, idx], silent = silent,
quick = TRUE)
if (!is.null(hook))
hookOut <- hook(obj)
nerror <- 0
})
if (is(tt, "try-error")) {
obj@fitC <- obj@fitD <- NULL
obj@fitted <- c(C = FALSE, D = FALSE)
nerror <- nerror + 1
totalerr = totalerr + 1
if (nerror > 5 & !force) {
stop("We seem to be having a lot of problems here...are your tests specified correctly? \n If you're sure, set force=TRUE.",
tt)
}
}
pb$tick(tokens = list(err = totalerr))
if (onlyCoef)
return(cbind(C = coef(obj, "C"), D = coef(obj, "D")))
summaries <- MAST:::summarize(obj)
structure(summaries, hookOut = hookOut)
}
if (!parallel || getOption("mc.cores", 1L) == 1) {
listOfSummaries <- lapply(listEE, .fitGeneSet)
}
else {
listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet,
mc.preschedule = TRUE, mc.silent = silent)
}
if (onlyCoef) {
out <- do.call(abind, c(listOfSummaries, rev.along = 0))
return(aperm(out, c(3, 1, 2)))
}
cls <- sapply(listOfSummaries, function(x) class(x))
complain <- if (force)
warning
else stop
if (mean(cls == "try-error") > 0.5)
complain("Lots of errors here..something is amiss.")
hookOut <- NULL
if (!is.null(hook))
hookOut <- lapply(listOfSummaries, attr, which = "hookOut")
message("\nDone!")
summaries <- MAST:::collectSummaries(listOfSummaries)
summaries[["LMlike"]] <- obj
summaries[["sca"]] <- sca
summaries[["priorVar"]] <- obj@priorVar
summaries[["priorDOF"]] <- obj@priorDOF
summaries[["hookOut"]] <- hookOut
summaries[["exprs_values"]] <- exprs_values
summaries[["Class"]] <- "ZlmFit"
zfit <- do.call(new, as.list(summaries))
zfit
}
# We will replace the original function with ours inside the Seurat namespace
assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat"))
# getFromNamespace("MASTDETest", "Seurat")
assignInNamespace('zlm', my_zlm, asNamespace("MAST"))
# getFromNamespace("zlm", "MAST")
# And now we can call our FindMarkers with random effects model for sample_ID
# Don't forget to include ebayes = F
markers <- FindMarkers(
sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE,
test.use = "MAST", verbose = F
, latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F
)
@rdalbanus thanks for sharing!
I also wrote a function - it's a while back now, but here it is: https://github.com/CBMR-Single-Cell-Omics-Platform/SCOPfunctions/blob/main/R/differential_expression.R
This clunky solution will add this functionality to
FindMarkers
by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.
# We want a different behavior from MASTDETest, where the latent variable is # used as a random effect in the model rem_MASTDETest <- function( data.use, cells.1, cells.2, latent.vars = NULL, verbose = TRUE, # New option - random effect variable (should be included in latent.vars) re.var = NULL, ... ) { # Check for MAST if (!PackageCheck('MAST', error = FALSE)) { stop("Please install MAST - learn more at https://github.com/RGLab/MAST") } group.info <- data.frame(row.names = c(cells.1, cells.2)) latent.vars <- latent.vars %||% group.info group.info[cells.1, "group"] <- "Group1" group.info[cells.2, "group"] <- "Group2" group.info[, "group"] <- factor(x = group.info[, "group"]) latent.vars.names <- c("condition", colnames(x = latent.vars)) latent.vars <- cbind(latent.vars, group.info) latent.vars$wellKey <- rownames(x = latent.vars) fdat <- data.frame(rownames(x = data.use)) colnames(x = fdat)[1] <- "primerid" rownames(x = fdat) <- fdat[, 1] sca <- MAST::FromMatrix( exprsArray = as.matrix(x = data.use), check_sanity = FALSE, cData = latent.vars, fData = fdat ) cond <- factor(x = SummarizedExperiment::colData(sca)$group) cond <- relevel(x = cond, ref = "Group1") SummarizedExperiment::colData(sca)$condition <- cond # This is the main change in the code - we want ~ ( 1 | re.var) in the formula: # ~ condition + lat.vars + (1 | re.var) if (!is.null(re.var)) { if (!re.var %in% latent.vars.names) { stop("Random effect variable should be included in latent variables!") } latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var] fmla <- as.formula( object = paste0( " ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})") ) ) # print(fmla) # We need glmer to make this work method <- "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...) } else { # Original code fmla <- as.formula( object = paste0(" ~ ", paste(latent.vars.names, collapse = "+")) ) zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...) } summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2') summaryDt <- summaryCond$datatable # The output format is slightly different, so we need adapt the code if(!is.null(re.var)) { p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)` genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid } else { p_val <- summaryDt[summaryDt[, "component"] == "H", 4] genes.return <- summaryDt[summaryDt[, "component"] == "H", 1] } to.return <- data.frame(p_val, row.names = genes.return) return(to.return) } # This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE, ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE, LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx, ...) { dotsdata = list(...)$data if (!is.null(dotsdata)) { if (!missing(sca)) stop("Cannot provide both `sca` and `data`") sca = dotsdata } if (!inherits(sca, "SingleCellAssay")) { if (inherits(sca, "data.frame")) { if (!is.null(dotsdata)) { return(.zlm(formula, method = method, silent = silent, ...)) } else { return(.zlm(formula, data = sca, method = method, silent = silent, ...)) } } else { stop("`sca` must inherit from `data.frame` or `SingleCellAssay`") } } if (missing(LMlike)) { method <- match.arg(method, MAST:::methodDict[, keyword]) method <- MAST:::methodDict[keyword == method, lmMethod] if (!is(sca, "SingleCellAssay")) stop("'sca' must be (or inherit) 'SingleCellAssay'") if (!is(formula, "formula")) stop("'formula' must be class 'formula'") Formula <- MAST:::removeResponse(formula) priorVar <- 1 priorDOF <- 0 if (ebayes) { # if (!methodDict[lmMethod == method, implementsEbayes]) if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes])) stop("Method", method, " does not implement empirical bayes variance shrinkage.") ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl, model.matrix(Formula, SummarizedExperiment::colData(sca))) priorVar <- ebparm[["v"]] priorDOF <- ebparm[["df"]] stopifnot(all(!is.na(ebparm))) } obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca), formula = Formula, priorVar = priorVar, priorDOF = priorDOF, extra = list(...)) } else { if (!missing(formula)) warning("Ignoring formula and using model defined in 'objLMLike'") if (!inherits(LMlike, "LMlike")) stop("'LMlike' must inherit from class 'LMlike'") obj <- LMlike } ee <- t(SummarizedExperiment:::assay(sca, exprs_values)) genes <- colnames(ee) ng <- length(genes) MM <- MAST:::model.matrix(obj) coefNames <- colnames(MM) listEE <- setNames(seq_len(ng), genes) obj <- MAST:::fit(obj, ee[, 1], silent = silent) nerror <- totalerr <- 0 pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures") .fitGeneSet <- function(idx) { hookOut <- NULL tt <- try({ obj <- MAST:::fit(obj, response = ee[, idx], silent = silent, quick = TRUE) if (!is.null(hook)) hookOut <- hook(obj) nerror <- 0 }) if (is(tt, "try-error")) { obj@fitC <- obj@fitD <- NULL obj@fitted <- c(C = FALSE, D = FALSE) nerror <- nerror + 1 totalerr = totalerr + 1 if (nerror > 5 & !force) { stop("We seem to be having a lot of problems here...are your tests specified correctly? \n If you're sure, set force=TRUE.", tt) } } pb$tick(tokens = list(err = totalerr)) if (onlyCoef) return(cbind(C = coef(obj, "C"), D = coef(obj, "D"))) summaries <- MAST:::summarize(obj) structure(summaries, hookOut = hookOut) } if (!parallel || getOption("mc.cores", 1L) == 1) { listOfSummaries <- lapply(listEE, .fitGeneSet) } else { listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet, mc.preschedule = TRUE, mc.silent = silent) } if (onlyCoef) { out <- do.call(abind, c(listOfSummaries, rev.along = 0)) return(aperm(out, c(3, 1, 2))) } cls <- sapply(listOfSummaries, function(x) class(x)) complain <- if (force) warning else stop if (mean(cls == "try-error") > 0.5) complain("Lots of errors here..something is amiss.") hookOut <- NULL if (!is.null(hook)) hookOut <- lapply(listOfSummaries, attr, which = "hookOut") message("\nDone!") summaries <- MAST:::collectSummaries(listOfSummaries) summaries[["LMlike"]] <- obj summaries[["sca"]] <- sca summaries[["priorVar"]] <- obj@priorVar summaries[["priorDOF"]] <- obj@priorDOF summaries[["hookOut"]] <- hookOut summaries[["exprs_values"]] <- exprs_values summaries[["Class"]] <- "ZlmFit" zfit <- do.call(new, as.list(summaries)) zfit } # We will replace the original function with ours inside the Seurat namespace assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat")) # getFromNamespace("MASTDETest", "Seurat") assignInNamespace('zlm', my_zlm, asNamespace("MAST")) # getFromNamespace("zlm", "MAST") # And now we can call our FindMarkers with random effects model for sample_ID # Don't forget to include ebayes = F markers <- FindMarkers( sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE, test.use = "MAST", verbose = F , latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F )
@rdalbanus Thanks for writing this script! I'm wondering why you wrote it in a way such that the random effect variable needs to also be a latent variable?
@andrewdchen
I only did it this way to keep code changes to a minimum - you can see that there's some internal processing of the latent.vars
flag, and I didn't want risk introducing bugs.
The latent.vars
flag captures all the covariates you want to include in the analyses, which get converted into a formula in the fmla <- as.formula(...)
line (e.g. latent.vars = c("sex", "age")
results in a model ~ sex + age
in the original code).
I only added a flag that selects one of these covariates and puts it in a (1 | x )
enclosing (e.g. latent.vars = c("donor", "sex", "age"), re.var = "donor"
results in ~ age + sex + (1|donor)
).
Dear Seurat Team,
I am contacting you in regards to a question about how to use your FindMarkers function to run MAST with a random effect added for subject. I have a single cell RNAseq dataset with two genotypes (4 subjects each) and I’m trying to run a cluster specific differential expression analysis between these two genotypes to identify cell-type specific DEGs. I was able to successfully run MAST for this purpose via your FindMarkers function by using the instructions provided in your vignettes. I now would like to slightly modify my approach and add a random effect for subject to avoid a simple pseudoreplication type issue while keeping other parameters (i.e., logfc.threshold, min.pct etc.) the same as my original approach. To my knowledge, it should be possible to tweak the MAST algorithm to incorporate a random effect but I couldn’t find any instructions online on using Seurat to do this. I would appreciate any recommendations you could give me.
Here is the code I used for running MAST without a random effect (ident.1 and ident.2 indicate wild type and mutant genotypes):
FindMarkers(s_obj4, ident.1 = "0_2", ident.2 = "0_1", min.pct = 0.1, logfc.threshold = 0.1, test.use = "MAST", verbose = FALSE, assay = "RNA", slot = "data", latent.vars = "nCount_RNA", pseudocount.use = 0.001)
Thank you! Esra