Closed zdebruine closed 3 years ago
awesome, my ligand receptor analysis codes rely heavily on this function
hope you can improve its efficiency
@chansigit Here you go. This is a fully functional function:
#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods
#'
#' @param object Seurat object
#' @param ident Ident with sample clustering information (default is the active ident)
#' @param assay Assay to average (default is the active assay)
#' @param slot Slot to average (default is counts)
#' @param verbose Boolean or integer, show progress bar (default is TRUE)
#' @return dense matrix of cluster centers
average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) {
require(Matrix)
ifelse(assay == "default",
data <- GetAssayData(object, slot = slot),
data <- GetAssayData(object, assay = assay, slot = slot))
# convert data to a sparse matrix if it isn't already
if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE)
# get the idents class over which to average
ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]]))
# loop through all idents, averaging them in data
ident.names <- unique(idents)
if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50)
m <- list()
for (i in 1:length(ident.names)) {
m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])])
if (verbose > 0) setTxtProgressBar(pb = pb, value = i)
}
result <- do.call(cbind, m)
colnames(result) <- ident.names
return(result)
}
You should be able to use it much like Seurat's Average.Expression
function. If you have any issues, please post here and I'll fix it. I hope to publish this at zdebruine/scNMF
as part of an scNMF
R package soon. At present that repo is unstable so don't use it quite yet.
Also, this is not the fastest implementation. It can be faster if implemented in Rcpp with sparse matrix indexing and parallelization with openmp. However, that takes more effort than the payoff would be for me.
Any takers though let me know.
Thanks for your suggestion @zdebruine! We are working on a faster and more flexible AverageExpression
that should land soon on develop.
Awesome, I'll be looking for it!
@chansigit Here you go. This is a fully functional function:
#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods #' #' @param object Seurat object #' @param ident Ident with sample clustering information (default is the active ident) #' @param assay Assay to average (default is the active assay) #' @param slot Slot to average (default is counts) #' @param verbose Boolean or integer, show progress bar (default is TRUE) #' @return dense matrix of cluster centers average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) { require(Matrix) ifelse(assay == "default", data <- GetAssayData(object, slot = slot), data <- GetAssayData(object, assay = assay, slot = slot)) # convert data to a sparse matrix if it isn't already if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE) # get the idents class over which to average ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]])) # loop through all idents, averaging them in data ident.names <- unique(idents) if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50) m <- list() for (i in 1:length(ident.names)) { m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])]) if (verbose > 0) setTxtProgressBar(pb = pb, value = i) } result <- do.call(cbind, m) colnames(result) <- ident.names return(result) }
You should be able to use it much like Seurat's
Average.Expression
function. If you have any issues, please post here and I'll fix it. I hope to publish this atzdebruine/scNMF
as part of anscNMF
R package soon. At present that repo is unstable so don't use it quite yet.
I noticed that this function differs from the AverageExpression provided by Seurat when slot="data". I guess the implementation is different? Is it a better idea to be consistent with Seurat's AverageExpression? (or I met some bugs)
@chansigit Here you go. This is a fully functional function:
#' Average feature expression across clustered samples in a Seurat object using fast sparse matrix methods #' #' @param object Seurat object #' @param ident Ident with sample clustering information (default is the active ident) #' @param assay Assay to average (default is the active assay) #' @param slot Slot to average (default is counts) #' @param verbose Boolean or integer, show progress bar (default is TRUE) #' @return dense matrix of cluster centers average.expression <- function(object, ident = "active.ident", assay = "default", slot = "counts", verbose = TRUE) { require(Matrix) ifelse(assay == "default", data <- GetAssayData(object, slot = slot), data <- GetAssayData(object, assay = assay, slot = slot)) # convert data to a sparse matrix if it isn't already if (class(data)[1] != "dgCMatrix") data <- Matrix(data, sparse = TRUE) # get the idents class over which to average ifelse(ident == "active.ident", idents <- as.vector(Idents(object)), idents <- as.vector(object@meta.data[[ident]])) # loop through all idents, averaging them in data ident.names <- unique(idents) if (verbose > 0) pb <- txtProgressBar(char = "=", style = 3, max = length(ident.names), width = 50) m <- list() for (i in 1:length(ident.names)) { m[[i]] <- Matrix::rowMeans(data[, which(idents == ident.names[i])]) if (verbose > 0) setTxtProgressBar(pb = pb, value = i) } result <- do.call(cbind, m) colnames(result) <- ident.names return(result) }
You should be able to use it much like Seurat's
Average.Expression
function. If you have any issues, please post here and I'll fix it. I hope to publish this atzdebruine/scNMF
as part of anscNMF
R package soon. At present that repo is unstable so don't use it quite yet.
I also noticed that the doc of AverageExpression() stated Output is in log-space when return.seurat = TRUE, otherwise it's in non-log space. Averaging is done in non-log space.
I suggested your snippets could follow this implementation or provided options.
I've noticed that AverageExpression() is a bit slower than a colsums-based function should be. Also, using sparse operations are way faster than dense operations.
I had a look at source code and it looks like there are a few edge case checks and additional stuff going on, but have a look at a basic SparseMatrix-based operation to average expression. This function simply merges by the active ident:
I've checked and this function gives the same result as Seurat's function.
Here's some Rbenchmark on the pbmc3k dataset with some clusters:
And the timing results:
It would be nice to have Seurat use faster and sparse operations, otherwise I'll probably end up pushing a new function that does the same thing, only 10x faster.