AverageExpression() function can be 10x faster

zdebruine commented 3 years ago

I've noticed that AverageExpression() is a bit slower than a colsums-based function should be. Also, using sparse operations are way faster than dense operations.

I had a look at source code and it looks like there are a few edge case checks and additional stuff going on, but have a look at a basic SparseMatrix-based operation to average expression. This function simply merges by the active ident:

AverageExpression.Sparse <- function(object){
  require(Matrix)
  data <- GetAssayData(object, slot = "counts")
  data <- Matrix(data, sparse = TRUE)
  ident.names <- unique(Idents(object))
  result <- NULL
  for(ident.name in ident.names){
    m <- Matrix::rowMeans(data[,which(idents == ident.name)])
    ifelse(is.null(result), result <- m, result <- cbind(result, m))
  }
  colnames(result) <- ident.names
  return(result)
}

I've checked and this function gives the same result as Seurat's function.

Here's some Rbenchmark on the pbmc3k dataset with some clusters:

library(rbenchmark)
benchmark("sparse" = {
            avg.sparse <- AverageExpression.Sparse(pbmc.divisive)
          },
          "Seurat" = {
            avg.seurat <- suppressMessages(AverageExpression(pbmc.divisive, assays = "SCT"))
          },
          replications = 10,
          columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")
        )

And the timing results:

    test replications elapsed relative user.self sys.self
2 Seurat           10    9.71   11.034      9.58     0.08
1 sparse           10    0.88    1.000      0.78     0.09

It would be nice to have Seurat use faster and sparse operations, otherwise I'll probably end up pushing a new function that does the same thing, only 10x faster.

chansigit commented 3 years ago

awesome, my ligand receptor analysis codes rely heavily on this function

hope you can improve its efficiency

zdebruine commented 3 years ago

@chansigit Here you go. This is a fully functional function:

#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods
#'
#' @param object Seurat object
#' @param ident Ident with sample clustering information (default is the active ident)
#' @param assay Assay to average (default is the active assay)
#' @param slot Slot to average (default is counts)
#' @param verbose Boolean or integer, show progress bar (default is TRUE)
#' @return dense matrix of cluster centers
average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) {
  require(Matrix)

  ifelse(assay == "default",
    data <- GetAssayData(object, slot = slot),
    data <- GetAssayData(object, assay = assay, slot = slot))

  # convert data to a sparse matrix if it isn't already
  if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE)

  # get the idents class over which to average
  ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]]))

  # loop through all idents, averaging them in data
  ident.names <- unique(idents)

  if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50)
  m <- list()
  for (i in 1:length(ident.names)) {
    m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])])
    if (verbose > 0) setTxtProgressBar(pb = pb, value = i)
  }
  result <- do.call(cbind, m)
  colnames(result) <- ident.names
  return(result)
}

You should be able to use it much like Seurat's Average.Expression function. If you have any issues, please post here and I'll fix it. I hope to publish this at zdebruine/scNMF as part of an scNMF R package soon. At present that repo is unstable so don't use it quite yet.

zdebruine commented 3 years ago

Also, this is not the fastest implementation. It can be faster if implemented in Rcpp with sparse matrix indexing and parallelization with openmp. However, that takes more effort than the payoff would be for me.

Any takers though let me know.

saketkc commented 3 years ago

Thanks for your suggestion @zdebruine! We are working on a faster and more flexible AverageExpression that should land soon on develop.

zdebruine commented 3 years ago

Awesome, I'll be looking for it!

chansigit commented 3 years ago

@chansigit Here you go. This is a fully functional function:

#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods
#'
#' @param object Seurat object
#' @param ident Ident with sample clustering information (default is the active ident)
#' @param assay Assay to average (default is the active assay)
#' @param slot Slot to average (default is counts)
#' @param verbose Boolean or integer, show progress bar (default is TRUE)
#' @return dense matrix of cluster centers
average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) {
  require(Matrix)

  ifelse(assay == "default",
    data <- GetAssayData(object, slot = slot),
    data <- GetAssayData(object, assay = assay, slot = slot))

  # convert data to a sparse matrix if it isn't already
  if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE)

  # get the idents class over which to average
  ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]]))

  # loop through all idents, averaging them in data
  ident.names <- unique(idents)

  if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50)
  m <- list()
  for (i in 1:length(ident.names)) {
    m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])])
    if (verbose > 0) setTxtProgressBar(pb = pb, value = i)
  }
  result <- do.call(cbind, m)
  colnames(result) <- ident.names
  return(result)
}

You should be able to use it much like Seurat's Average.Expression function. If you have any issues, please post here and I'll fix it. I hope to publish this at zdebruine/scNMF as part of an scNMF R package soon. At present that repo is unstable so don't use it quite yet.

I noticed that this function differs from the AverageExpression provided by Seurat when slot="data". I guess the implementation is different? Is it a better idea to be consistent with Seurat's AverageExpression? (or I met some bugs)

chansigit commented 3 years ago

@chansigit Here you go. This is a fully functional function:

#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods
#'
#' @param object Seurat object
#' @param ident Ident with sample clustering information (default is the active ident)
#' @param assay Assay to average (default is the active assay)
#' @param slot Slot to average (default is counts)
#' @param verbose Boolean or integer, show progress bar (default is TRUE)
#' @return dense matrix of cluster centers
average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) {
  require(Matrix)

  ifelse(assay == "default",
    data <- GetAssayData(object, slot = slot),
    data <- GetAssayData(object, assay = assay, slot = slot))

  # convert data to a sparse matrix if it isn't already
  if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE)

  # get the idents class over which to average
  ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]]))

  # loop through all idents, averaging them in data
  ident.names <- unique(idents)

  if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50)
  m <- list()
  for (i in 1:length(ident.names)) {
    m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])])
    if (verbose > 0) setTxtProgressBar(pb = pb, value = i)
  }
  result <- do.call(cbind, m)
  colnames(result) <- ident.names
  return(result)
}

You should be able to use it much like Seurat's Average.Expression function. If you have any issues, please post here and I'll fix it. I hope to publish this at zdebruine/scNMF as part of an scNMF R package soon. At present that repo is unstable so don't use it quite yet.

I also noticed that the doc of AverageExpression() stated Output is in log-space when return.seurat = TRUE, otherwise it's in non-log space. Averaging is done in non-log space.

I suggested your snippets could follow this implementation or provided options.

satijalab / seurat

AverageExpression() function can be 10x faster #3756