satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 914 forks source link

Running MAST with random effect for subject via FindMarkers #3712

Closed esrasefik closed 3 years ago

esrasefik commented 3 years ago

Dear Seurat Team,

I am contacting you in regards to a question about how to use your FindMarkers function to run MAST with a random effect added for subject. I have a single cell RNAseq dataset with two genotypes (4 subjects each) and I’m trying to run a cluster specific differential expression analysis between these two genotypes to identify cell-type specific DEGs. I was able to successfully run MAST for this purpose via your FindMarkers function by using the instructions provided in your vignettes. I now would like to slightly modify my approach and add a random effect for subject to avoid a simple pseudoreplication type issue while keeping other parameters (i.e., logfc.threshold, min.pct etc.) the same as my original approach. To my knowledge, it should be possible to tweak the MAST algorithm to incorporate a random effect but I couldn’t find any instructions online on using Seurat to do this. I would appreciate any recommendations you could give me.

Here is the code I used for running MAST without a random effect (ident.1 and ident.2 indicate wild type and mutant genotypes):

FindMarkers(s_obj4, ident.1 = "0_2", ident.2 = "0_1", min.pct = 0.1, logfc.threshold = 0.1, test.use = "MAST", verbose = FALSE, assay = "RNA", slot = "data", latent.vars = "nCount_RNA", pseudocount.use = 0.001)

Thank you! Esra

jaisonj708 commented 3 years ago

We currently do not support this, but you may try altering our MAST wrapper to incorporate a random effect. The hurdle model is specified by the formula parameter in the zlm function, called in MASTDETest:

zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)

jgamache014 commented 3 years ago

@esrasefik just wondering if you've gotten this to work, and if you wouldn't mind sharing your script? I am trying to do the same with my differential expression analysis.

esrasefik commented 3 years ago

@jgamache014 - I unfortunately haven't been able to. I decided to switch to using MAST directly instead of using the Seurat interface given the little support the developers provided for this issue. I don't yet have a solution I can share - still working on it. Sorry!

jgamache014 commented 3 years ago

@esrasefik Thanks, I'll try using MAST directly as well!

JonThom commented 3 years ago

@jgamache014 @esrasefik have any of you succeeded? If not, what have been the stumbling blocks? I'd like to have a crack at it.

rdalbanus commented 1 year ago

This clunky solution will add this functionality to FindMarkers by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.

Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.


# We want a different behavior from MASTDETest, where the latent variable is
# used as a random effect in the model
rem_MASTDETest <- function(
  data.use,
  cells.1,
  cells.2,
  latent.vars = NULL,
  verbose = TRUE,
  # New option - random effect variable (should be included in latent.vars)
  re.var = NULL,
  ...
) {
  # Check for MAST
  if (!PackageCheck('MAST', error = FALSE)) {
    stop("Please install MAST - learn more at https://github.com/RGLab/MAST")
  }
  group.info <- data.frame(row.names = c(cells.1, cells.2))
  latent.vars <- latent.vars %||% group.info
  group.info[cells.1, "group"] <- "Group1"
  group.info[cells.2, "group"] <- "Group2"
  group.info[, "group"] <- factor(x = group.info[, "group"])
  latent.vars.names <- c("condition", colnames(x = latent.vars))
  latent.vars <- cbind(latent.vars, group.info)
  latent.vars$wellKey <- rownames(x = latent.vars)
  fdat <- data.frame(rownames(x = data.use))
  colnames(x = fdat)[1] <- "primerid"
  rownames(x = fdat) <- fdat[, 1]
  sca <- MAST::FromMatrix(
    exprsArray = as.matrix(x = data.use),
    check_sanity = FALSE,
    cData = latent.vars,
    fData = fdat
  )
  cond <- factor(x = SummarizedExperiment::colData(sca)$group)
  cond <- relevel(x = cond, ref = "Group1")
  SummarizedExperiment::colData(sca)$condition <- cond

  # This is the main change in the code - we want ~ ( 1 | re.var) in the formula:
  # ~ condition + lat.vars + (1 | re.var)
  if (!is.null(re.var)) {
    if (!re.var %in% latent.vars.names) {
      stop("Random effect variable should be included in latent variables!")
    }
    latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var]
    fmla <- as.formula(
      object = paste0(
        " ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})")
      )
    )
    # print(fmla)
    # We need glmer to make this work
    method <-  "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call    
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...)
  } else {
    # Original code
    fmla <- as.formula(
      object = paste0(" ~ ", paste(latent.vars.names, collapse = "+"))
    )
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
  }
  summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2')
  summaryDt <- summaryCond$datatable

  # The output format is slightly different, so we need adapt the code
  if(!is.null(re.var)) {
    p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)`
    genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid
  } else {
    p_val <- summaryDt[summaryDt[, "component"] == "H", 4]
    genes.return <- summaryDt[summaryDt[, "component"] == "H", 1]
  }

  to.return <- data.frame(p_val, row.names = genes.return)
  return(to.return)
}

# This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash
my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE, 
    ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE, 
    LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx, 
    ...) 
{
    dotsdata = list(...)$data
    if (!is.null(dotsdata)) {
        if (!missing(sca)) 
            stop("Cannot provide both `sca` and `data`")
        sca = dotsdata
    }
    if (!inherits(sca, "SingleCellAssay")) {
        if (inherits(sca, "data.frame")) {
            if (!is.null(dotsdata)) {
                return(.zlm(formula, method = method, silent = silent, 
                  ...))
            }
            else {
                return(.zlm(formula, data = sca, method = method, 
                  silent = silent, ...))
            }
        }
        else {
            stop("`sca` must inherit from `data.frame` or `SingleCellAssay`")
        }
    }
    if (missing(LMlike)) {
        method <- match.arg(method, MAST:::methodDict[, keyword])
        method <- MAST:::methodDict[keyword == method, lmMethod]
        if (!is(sca, "SingleCellAssay")) 
            stop("'sca' must be (or inherit) 'SingleCellAssay'")
        if (!is(formula, "formula")) 
            stop("'formula' must be class 'formula'")
        Formula <- MAST:::removeResponse(formula)
        priorVar <- 1
        priorDOF <- 0
        if (ebayes) {
            # if (!methodDict[lmMethod == method, implementsEbayes]) 
          if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes]))
          stop("Method", method, " does not implement empirical bayes variance shrinkage.")
            ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl, 
                model.matrix(Formula, SummarizedExperiment::colData(sca)))
            priorVar <- ebparm[["v"]]
            priorDOF <- ebparm[["df"]]
            stopifnot(all(!is.na(ebparm)))
        }
        obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca), 
            formula = Formula, priorVar = priorVar, priorDOF = priorDOF, 
            extra = list(...))
    }
    else {
        if (!missing(formula)) 
            warning("Ignoring formula and using model defined in 'objLMLike'")
        if (!inherits(LMlike, "LMlike")) 
            stop("'LMlike' must inherit from class 'LMlike'")
        obj <- LMlike
    }
    ee <- t(SummarizedExperiment:::assay(sca, exprs_values))
    genes <- colnames(ee)
    ng <- length(genes)
    MM <- MAST:::model.matrix(obj)
    coefNames <- colnames(MM)
    listEE <- setNames(seq_len(ng), genes)
    obj <- MAST:::fit(obj, ee[, 1], silent = silent)
    nerror <- totalerr <- 0
    pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures")
    .fitGeneSet <- function(idx) {
        hookOut <- NULL
        tt <- try({
            obj <- MAST:::fit(obj, response = ee[, idx], silent = silent, 
                quick = TRUE)
            if (!is.null(hook)) 
                hookOut <- hook(obj)
            nerror <- 0
        })
        if (is(tt, "try-error")) {
            obj@fitC <- obj@fitD <- NULL
            obj@fitted <- c(C = FALSE, D = FALSE)
            nerror <- nerror + 1
            totalerr = totalerr + 1
            if (nerror > 5 & !force) {
                stop("We seem to be having a lot of problems here...are your tests specified correctly?  \n If you're sure, set force=TRUE.", 
                  tt)
            }
        }
        pb$tick(tokens = list(err = totalerr))
        if (onlyCoef) 
            return(cbind(C = coef(obj, "C"), D = coef(obj, "D")))
        summaries <- MAST:::summarize(obj)
        structure(summaries, hookOut = hookOut)
    }
    if (!parallel || getOption("mc.cores", 1L) == 1) {
        listOfSummaries <- lapply(listEE, .fitGeneSet)
    }
    else {
        listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet, 
            mc.preschedule = TRUE, mc.silent = silent)
    }
    if (onlyCoef) {
        out <- do.call(abind, c(listOfSummaries, rev.along = 0))
        return(aperm(out, c(3, 1, 2)))
    }
    cls <- sapply(listOfSummaries, function(x) class(x))
    complain <- if (force) 
        warning
    else stop
    if (mean(cls == "try-error") > 0.5) 
        complain("Lots of errors here..something is amiss.")
    hookOut <- NULL
    if (!is.null(hook)) 
        hookOut <- lapply(listOfSummaries, attr, which = "hookOut")
    message("\nDone!")
    summaries <- MAST:::collectSummaries(listOfSummaries)
    summaries[["LMlike"]] <- obj
    summaries[["sca"]] <- sca
    summaries[["priorVar"]] <- obj@priorVar
    summaries[["priorDOF"]] <- obj@priorDOF
    summaries[["hookOut"]] <- hookOut
    summaries[["exprs_values"]] <- exprs_values
    summaries[["Class"]] <- "ZlmFit"
    zfit <- do.call(new, as.list(summaries))
    zfit
}

# We will replace the original function with ours inside the Seurat namespace
assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat"))
# getFromNamespace("MASTDETest", "Seurat")

assignInNamespace('zlm', my_zlm, asNamespace("MAST"))
# getFromNamespace("zlm", "MAST")

# And now we can call our FindMarkers with random effects model for sample_ID
# Don't forget to include ebayes = F
markers <- FindMarkers(
    sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE, 
    test.use = "MAST", verbose = F
    , latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F
  )
JonThom commented 1 year ago

@rdalbanus thanks for sharing!

I also wrote a function - it's a while back now, but here it is: https://github.com/CBMR-Single-Cell-Omics-Platform/SCOPfunctions/blob/main/R/differential_expression.R

andrewdchen commented 1 year ago

This clunky solution will add this functionality to FindMarkers by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.

Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.

# We want a different behavior from MASTDETest, where the latent variable is
# used as a random effect in the model
rem_MASTDETest <- function(
  data.use,
  cells.1,
  cells.2,
  latent.vars = NULL,
  verbose = TRUE,
  # New option - random effect variable (should be included in latent.vars)
  re.var = NULL,
  ...
) {
  # Check for MAST
  if (!PackageCheck('MAST', error = FALSE)) {
    stop("Please install MAST - learn more at https://github.com/RGLab/MAST")
  }
  group.info <- data.frame(row.names = c(cells.1, cells.2))
  latent.vars <- latent.vars %||% group.info
  group.info[cells.1, "group"] <- "Group1"
  group.info[cells.2, "group"] <- "Group2"
  group.info[, "group"] <- factor(x = group.info[, "group"])
  latent.vars.names <- c("condition", colnames(x = latent.vars))
  latent.vars <- cbind(latent.vars, group.info)
  latent.vars$wellKey <- rownames(x = latent.vars)
  fdat <- data.frame(rownames(x = data.use))
  colnames(x = fdat)[1] <- "primerid"
  rownames(x = fdat) <- fdat[, 1]
  sca <- MAST::FromMatrix(
    exprsArray = as.matrix(x = data.use),
    check_sanity = FALSE,
    cData = latent.vars,
    fData = fdat
  )
  cond <- factor(x = SummarizedExperiment::colData(sca)$group)
  cond <- relevel(x = cond, ref = "Group1")
  SummarizedExperiment::colData(sca)$condition <- cond

  # This is the main change in the code - we want ~ ( 1 | re.var) in the formula:
  # ~ condition + lat.vars + (1 | re.var)
  if (!is.null(re.var)) {
    if (!re.var %in% latent.vars.names) {
      stop("Random effect variable should be included in latent variables!")
    }
    latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var]
    fmla <- as.formula(
      object = paste0(
        " ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})")
      )
    )
    # print(fmla)
    # We need glmer to make this work
    method <-  "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call    
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...)
  } else {
    # Original code
    fmla <- as.formula(
      object = paste0(" ~ ", paste(latent.vars.names, collapse = "+"))
    )
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
  }
  summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2')
  summaryDt <- summaryCond$datatable

  # The output format is slightly different, so we need adapt the code
  if(!is.null(re.var)) {
    p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)`
    genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid
  } else {
    p_val <- summaryDt[summaryDt[, "component"] == "H", 4]
    genes.return <- summaryDt[summaryDt[, "component"] == "H", 1]
  }

  to.return <- data.frame(p_val, row.names = genes.return)
  return(to.return)
}

# This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash
my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE, 
    ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE, 
    LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx, 
    ...) 
{
    dotsdata = list(...)$data
    if (!is.null(dotsdata)) {
        if (!missing(sca)) 
            stop("Cannot provide both `sca` and `data`")
        sca = dotsdata
    }
    if (!inherits(sca, "SingleCellAssay")) {
        if (inherits(sca, "data.frame")) {
            if (!is.null(dotsdata)) {
                return(.zlm(formula, method = method, silent = silent, 
                  ...))
            }
            else {
                return(.zlm(formula, data = sca, method = method, 
                  silent = silent, ...))
            }
        }
        else {
            stop("`sca` must inherit from `data.frame` or `SingleCellAssay`")
        }
    }
    if (missing(LMlike)) {
        method <- match.arg(method, MAST:::methodDict[, keyword])
        method <- MAST:::methodDict[keyword == method, lmMethod]
        if (!is(sca, "SingleCellAssay")) 
            stop("'sca' must be (or inherit) 'SingleCellAssay'")
        if (!is(formula, "formula")) 
            stop("'formula' must be class 'formula'")
        Formula <- MAST:::removeResponse(formula)
        priorVar <- 1
        priorDOF <- 0
        if (ebayes) {
            # if (!methodDict[lmMethod == method, implementsEbayes]) 
          if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes]))
          stop("Method", method, " does not implement empirical bayes variance shrinkage.")
            ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl, 
                model.matrix(Formula, SummarizedExperiment::colData(sca)))
            priorVar <- ebparm[["v"]]
            priorDOF <- ebparm[["df"]]
            stopifnot(all(!is.na(ebparm)))
        }
        obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca), 
            formula = Formula, priorVar = priorVar, priorDOF = priorDOF, 
            extra = list(...))
    }
    else {
        if (!missing(formula)) 
            warning("Ignoring formula and using model defined in 'objLMLike'")
        if (!inherits(LMlike, "LMlike")) 
            stop("'LMlike' must inherit from class 'LMlike'")
        obj <- LMlike
    }
    ee <- t(SummarizedExperiment:::assay(sca, exprs_values))
    genes <- colnames(ee)
    ng <- length(genes)
    MM <- MAST:::model.matrix(obj)
    coefNames <- colnames(MM)
    listEE <- setNames(seq_len(ng), genes)
    obj <- MAST:::fit(obj, ee[, 1], silent = silent)
    nerror <- totalerr <- 0
    pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures")
    .fitGeneSet <- function(idx) {
        hookOut <- NULL
        tt <- try({
            obj <- MAST:::fit(obj, response = ee[, idx], silent = silent, 
                quick = TRUE)
            if (!is.null(hook)) 
                hookOut <- hook(obj)
            nerror <- 0
        })
        if (is(tt, "try-error")) {
            obj@fitC <- obj@fitD <- NULL
            obj@fitted <- c(C = FALSE, D = FALSE)
            nerror <- nerror + 1
            totalerr = totalerr + 1
            if (nerror > 5 & !force) {
                stop("We seem to be having a lot of problems here...are your tests specified correctly?  \n If you're sure, set force=TRUE.", 
                  tt)
            }
        }
        pb$tick(tokens = list(err = totalerr))
        if (onlyCoef) 
            return(cbind(C = coef(obj, "C"), D = coef(obj, "D")))
        summaries <- MAST:::summarize(obj)
        structure(summaries, hookOut = hookOut)
    }
    if (!parallel || getOption("mc.cores", 1L) == 1) {
        listOfSummaries <- lapply(listEE, .fitGeneSet)
    }
    else {
        listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet, 
            mc.preschedule = TRUE, mc.silent = silent)
    }
    if (onlyCoef) {
        out <- do.call(abind, c(listOfSummaries, rev.along = 0))
        return(aperm(out, c(3, 1, 2)))
    }
    cls <- sapply(listOfSummaries, function(x) class(x))
    complain <- if (force) 
        warning
    else stop
    if (mean(cls == "try-error") > 0.5) 
        complain("Lots of errors here..something is amiss.")
    hookOut <- NULL
    if (!is.null(hook)) 
        hookOut <- lapply(listOfSummaries, attr, which = "hookOut")
    message("\nDone!")
    summaries <- MAST:::collectSummaries(listOfSummaries)
    summaries[["LMlike"]] <- obj
    summaries[["sca"]] <- sca
    summaries[["priorVar"]] <- obj@priorVar
    summaries[["priorDOF"]] <- obj@priorDOF
    summaries[["hookOut"]] <- hookOut
    summaries[["exprs_values"]] <- exprs_values
    summaries[["Class"]] <- "ZlmFit"
    zfit <- do.call(new, as.list(summaries))
    zfit
}

# We will replace the original function with ours inside the Seurat namespace
assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat"))
# getFromNamespace("MASTDETest", "Seurat")

assignInNamespace('zlm', my_zlm, asNamespace("MAST"))
# getFromNamespace("zlm", "MAST")

# And now we can call our FindMarkers with random effects model for sample_ID
# Don't forget to include ebayes = F
markers <- FindMarkers(
    sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE, 
    test.use = "MAST", verbose = F
    , latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F
  )

@rdalbanus Thanks for writing this script! I'm wondering why you wrote it in a way such that the random effect variable needs to also be a latent variable?

rdalbanus commented 1 year ago

@andrewdchen I only did it this way to keep code changes to a minimum - you can see that there's some internal processing of the latent.vars flag, and I didn't want risk introducing bugs.

The latent.vars flag captures all the covariates you want to include in the analyses, which get converted into a formula in the fmla <- as.formula(...) line (e.g. latent.vars = c("sex", "age") results in a model ~ sex + age in the original code).

I only added a flag that selects one of these covariates and puts it in a (1 | x ) enclosing (e.g. latent.vars = c("donor", "sex", "age"), re.var = "donor" results in ~ age + sex + (1|donor)).