satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.2k stars 894 forks source link

IntegrateLayers very slow without group.by #7879

Open mihem opened 9 months ago

mihem commented 9 months ago

Using IntegrateLayers without specifying group.by is slow. With my ~ 400 000 cells dataset, only the "preprocessing" before the actual integration, takes rather long (~10 min).

If I specify group.by this "preprocessing" only takes a few seconds. When I tried to debug this, I've found that the labels function is the culprit. https://github.com/satijalab/seurat/blob/678141f74c3bada0fc2b599378430d1ba46b441d/R/integration5.R#L448-L450

So maybe find some faster alternative code for that or maybe even force the user to specify group.by (don't really see the advantage in making this argument optional), Maybe @Gesmira?

Thank you!

mhkowalski commented 9 months ago

Hi, can you please post the full code you are using to run IntegrateLayers? We're actually not going to support this in future releases, so I'd recommend not using this parameter (and splitting your layers according to the variable you'd like to integrate by).

mihem commented 9 months ago

Hi @mhkowalski thanks your for response.

Sure.

sc_merge <- IntegrateLayers(
  object = sc_merge,
  method = scVIIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.scvi",
  conda_env = "~/miniconda3/envs/scvi",
  group.by = "sample",
  verbose = TRUE
)

I actually split it by sample and I would like to integrate by sample, but IntegrateLayers does not to pick that up.

Previous related code

# merge seurat objects  ------------------------------------------
sc_merge_pre <- merge(x = sc_filter[[1]], y = sc_filter[-1], merge.data = TRUE, add.cell.ids = names(sc_list))
sc_merge_pre$sample <- str_extract(colnames(sc_merge_pre), pattern = "[^_]+")

#this needs to be rejoined and then split again to get the correct names (better naming and required for integration)
sc_merge_pre <- JoinLayers(sc_merge_pre)
sc_merge_pre <- split(x = sc_merge_pre, f = sc_merge_pre$sample)

for (i in Layers(sc_merge_pre)){
    sc_merge_pre[["RNA"]][[i]] <- as(sc_merge_pre[["RNA"]][[i]], "dgCMatrix")
    sc_merge_pre[["RNA"]][[i]] <- write_matrix_dir(sc_merge_pre[["RNA"]][[i]], dir = paste0("matrix_final/", gsub("counts.", "", i)))
}

Not sure if this is related. But I would like to keep separate seurat objects at the beginning (as it was in Seurat v4) and them merge them at a later stage (because this is easier then with QC plotting and doublet detection etc. which should be done on each sample separately i think), so Gesmira came up with this: https://github.com/satijalab/seurat/issues/7373#issuecomment-1629249019 which worked great in my case.

And using group.by is a solution i can live with.

But of course problematic if you want to remove that parameter. I also think this parameter should not be removed because let's say I want to integrate by another column, which defines 3 groups, but the layers were split by the 30 samples?

mhkowalski commented 9 months ago

Thanks for the additional info! Are the results you observe identical with and without the group.by parameter? We found that this option was likely not implemented correctly in some cases, hence the decision to remove it. If you'd like to integrate by a different column, I would recommend joining and splitting the layers according to that column.

mihem commented 9 months ago

Thanks for your repnse @mhkowalski .

I had a large object with ~500 000 cells, which is not optimal for testing.

So I produced a minimal reproducible (well if you have the h5 files, but should be the same with other samples I think) example. I mostly follow this tutorial https://satijalab.org/seurat/articles/parsebio_sketch_integration, only that I use a list of Seurat objects at the beginning.

library(Seurat)
library(Azimuth)
library(SeuratObject)
library(SeuratDisk)
library(SeuratData)
library(SeuratWrappers)
library(BPCells)
library(tidyverse)

h5_path <- list.files(pattern = ".h5", recursive = TRUE)
sc_raw_list <- lapply(h5_path, BPCells::open_matrix_10x_hdf5)
sample_names <- c("S01", "S02")
names(sc_raw_list) <- sample_names

map2(
  .x = sc_raw_list,
  .y = file.path("matrix", names(sc_raw_list)),
  .f = BPCells::write_matrix_dir
)

sc_raw_mat <-
  map(
    .x = file.path("matrix", names(sc_raw_list)),
    .f = open_matrix_dir
  ) |>
  setNames(sample_names)

sc_raw_mat <-
  map(
    .x = sc_raw_mat,
    .f = function(x) Azimuth:::ConvertEnsembleToSymbol(mat = x, species = "human")
  )

sc_list <- map(sc_raw_mat, CreateSeuratObject, min.cells = 3, min.features = 200)

# filter cells and remove doublets which works better with a list of Seurat objects
# ....

# merge seurat objects now
sc_merge <- merge(x = sc_list[[1]], y = sc_list[-1], merge.data = TRUE, add.cell.ids = names(sc_list))
sc_merge$sample <- str_extract(colnames(sc_merge), pattern = "[^_]+")

#this needs to be rejoined and then split again to get the correct names (better naming and required for integration)
sc_merge <- JoinLayers(sc_merge)
sc_merge <- split(x = sc_merge, f = sc_merge$sample)

for (i in Layers(sc_merge)){
    sc_merge[["RNA"]][[i]] <- as(sc_merge[["RNA"]][[i]], "dgCMatrix")
    sc_merge[["RNA"]][[i]] <- write_matrix_dir(sc_merge[["RNA"]][[i]], dir = paste0("matrix_final/", gsub("counts.", "", i)))
}

sc_merge <- JoinLayers(sc_merge)

sc_merge <- split(x = sc_merge, f = sc_merge$sample)

for (i in Layers(sc_merge)){
    sc_merge[["RNA"]][[i]] <- as(sc_merge[["RNA"]][[i]], "dgCMatrix")
    sc_merge[["RNA"]][[i]] <- write_matrix_dir(sc_merge[["RNA"]][[i]], dir = paste0("matrix_final2/", gsub("counts.", "", i)))
}

sc_merge <- NormalizeData(sc_merge, verbose = TRUE, normalization.method = "LogNormalize", scale.factor = 10000)
sc_merge <- FindVariableFeatures(sc_merge, selection.method = "vst", nfeatures = 2000)

sc_merge <- SketchData(sc_merge, ncells = 5000, method = "LeverageScore", sketched.assay = "sketch")

DefaultAssay(sc_merge) <- "sketch"
sc_merge <- FindVariableFeatures(sc_merge)
sc_merge <- ScaleData(sc_merge)
sc_merge <- RunPCA(sc_merge)

qs::qsave(sc_merge, "sc_merge.qs")

system.time(
  test1 <- IntegrateLayersDebug(
    object = sc_merge,
    method = scVIIntegration,
    orig.reduction = "pca",
    new.reduction = "integrated.scvi",
    conda_env = "~/miniconda3/envs/scvi",
    group.by = "sample",
    verbose = TRUE
  )
)

system.time(
  test2 <- IntegrateLayersDebug(
    object = sc_merge,
    method = scVIIntegration,
    orig.reduction = "pca",
    new.reduction = "integrated.scvi",
    conda_env = "~/miniconda3/envs/scvi",
    verbose = TRUE
  )
)

I create a custom function IntegrateLayersDebug where I removed the last lines (so that it doesn't actually run the integration only the preprocessing") and then return groups.

``` IntegrateLayersDebug <- function( object, method, orig.reduction = "pca", group.by = NULL, assay = NULL, features = NULL, layers = NULL, scale.layer = "scale.data", ...) { assay <- assay %||% DefaultAssay(object = object) if (inherits(x = object[[assay]], what = "SCTAssay")) { layers <- "data" scale.layer <- "scale.data" features <- features %||% SelectSCTIntegrationFeatures( object = object, assay = assay ) } else if (inherits(x = object[[assay]], what = "StdAssay")) { layers <- Layers(object = object, assay = assay, search = layers %||% "data") scale.layer <- Layers(object = object, search = scale.layer) features <- features %||% VariableFeatures( object = object, assay = assay, nfeatures = 2000L ) } else { abort(message = "'assay' must be a v5 or SCT assay") } if (!is.null(scale.layer)) { features <- intersect(x = features, y = Features( x = object, assay = assay, layer = scale.layer )) } if (!length(x = features)) { abort(message = "None of the features provided are found in this assay") } if (!is.null(orig.reduction)) { orig.reduction <- orig.reduction %||% DefaultDimReduc( object = object, assay = assay ) if (!orig.reduction %in% Reductions(object = object)) { abort(message = paste( sQuote(x = orig.reduction), "is not a dimensional reduction" )) } obj.orig <- object[[orig.reduction]] if (is.null(x = DefaultAssay(object = obj.orig))) { DefaultAssay(object = obj.orig) <- assay } } groups <- if (inherits(x = object[[assay]], what = "SCTAssay")) { if (!is.null(x = group.by)) { warn(message = "Groups are set automatically by model when integrating SCT assays") } df <- SeuratObject::EmptyDF(n = ncol(x = object[[assay]])) row.names(x = df) <- colnames(x = object[[assay]]) for (model in levels(x = object[[assay]])) { cc <- Cells(x = object[[assay]], layer = model) df[cc, "group"] <- model } df } else if (is.null(x = group.by) && length(x = layers) > 1L) { cmap <- slot(object = object[[assay]], name = "cells")[ , layers ] as.data.frame(x = labels(object = cmap, values = Cells( x = object[[assay]], layer = scale.layer ))) } else if (rlang::is_scalar_character(x = group.by) && group.by %in% names(x = object[[]])) { FetchData(object = object, vars = group.by, cells = colnames(x = object[[assay]])) } else { abort(message = "'group.by' must correspond to a column of cell-level meta data") } names(x = groups) <- "group" return(groups) } ```

If you provide group.by it's ~70x faster, although 2s are of course absolutely acceptable. But thinks BPCells is all about speed, and you have datasets that are >100 samples, this is relevant I think.

# run time of test1 (so with group.by)
 user  system elapsed 
  0.031   0.004   0.036 

# runt time of test2 (so without group.by)
user  system elapsed 
  2.292   0.285   2.570 

The result has the same meaning, but it's not the same

>head(test1$group)
[1] "S01" "S01" "S01" "S01" "S01" "S01"

head(test2$group)
[1] "data.S01" "data.S01" "data.S01" "data.S01" "data.S01" "data.S01"

I think both produce the same result? At least when I provided group.by integration looked fine.

mihem commented 8 months ago

@mhkowalski sorry any update on this?

ksoleary commented 5 months ago

I may have found a way to speed this process up until it is resolved.

If I drop all scaled and normalized layers, I can merge all the counts matrices then just renormalize and scale this larger matrix. Doing this was much faster than just letting JoinLayers() run. For example:

merged[["RNA"]]$data.1 <- NULL

Do this for all normalized and scaled layers. Then run JoinLayers(). You'll get a merged counts matrix that you can normalize and scale.

mihem commented 1 month ago

@mhkowalski or @Gesmira Sorry, are there any updates on this issue?

I still have the same issue, now even worse because group.by is no longer an argument in IntegrateLayers.

CreateIntegrationGroups is the rate limiting step in my analysis, all previous steps (Normalize, FindVariableFeatures, Scale), and also the integration itself is faster than this rather trivial function.

Here another reproducible example:

library(Seurat)
library(SeuratData)
options(future.globals.maxSize = 1e9)

InstallData("pbmcsca")
obj <- LoadData("pbmcsca")

obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)

layers <- Layers(object = obj, assay = "RNA", search = "data")
scale.layer <- Layers(object = obj, search = "scale.data")

system.time(
    groups <- Seurat:::CreateIntegrationGroups(obj[["RNA"]], layers = layers, scale.layer = scale.layer)
)

This takes 50 s on my computer

>>>user  system elapsed 
>>> 50.069   0.280  50.476

To be more specific, the rate limiting step is:

system.time(
    labels(
        object = cmap,
        values = Cells(x = obj, layer = scale.layer)
    )
)

>>>  user  system elapsed 
>>> 49.619   0.265  49.892

On my actualn novel dataset (120 000 cells, so 3x smaller than the previous one) only this step takes 589s (while the actual integration with harmony takes less than a minute).

mihem commented 1 month ago

So the culprit is the labels function:

function (object, values, select = c("first", "last", "common", 
    "all"), simplify = TRUE, ...) 
{
    select <- select[1L]
    select <- match.arg(arg = select)
    values <- intersect(x = values, y = rownames(x = object))
    p <- progressor(along = values)
    obs <- sapply(X = values, FUN = function(x) {
        vals <- colnames(x = object)[which(x = object[x, , drop = TRUE])]
        p()
        return(vals)
    }, simplify = FALSE, USE.NAMES = TRUE)
    obs <- Filter(f = length, x = obs)
    obs <- switch(EXPR = select, first = lapply(X = obs, FUN = "[[", 
        1L), last = lapply(X = obs, FUN = function(x) {
        return(x[[length(x = x)]])
    }), common = {
        counts <- table(unlist(x = obs))
        tmp <- obs
        obs <- vector(mode = "character", length = length(x = tmp))
        names(x = obs) <- names(x = tmp)
        for (i in seq_along(along.with = obs)) {
            obs[i] <- names(x = which.max(x = counts[names(x = counts) %in% 
                tmp[[i]]]))
        }
        obs
    }, obs)
    if (isTRUE(x = simplify)) {
        tmp <- obs
        obs <- unlist(x = tmp)
        names(x = obs) <- make.unique(names = rep.int(x = names(x = tmp), 
            times = vapply(X = tmp, FUN = length, FUN.VALUE = numeric(length = 1L))))
    }
    return(obs)
}
<bytecode: 0x557ceaaba5b0>
<environment: namespace:SeuratObject>

With the help of ChatGPT i tried to optimize the function:

optimized_labels <- function(object, values, select = c("first", "last", "common", "all"), simplify = TRUE, ...) {
    select <- select[1L]
    select <- match.arg(arg = select)
    values <- intersect(values, rownames(object))

    if (length(values) == 0) {
        return(character(0))
    }

    # Precompute logical matrix and column names
    cmap_data <- as.matrix(object)
    colnames_object <- colnames(object)
    rownames_object <- rownames(object)

    # Get row indices for values
    row_indices <- match(values, rownames_object)

    # Initialize the list to store results
    obs <- vector("list", length(values))
    names(obs) <- values

    # Direct indexing to replace sapply
    for (i in seq_along(row_indices)) {
        row_idx <- row_indices[i]
        vals <- colnames_object[cmap_data[row_idx, , drop = FALSE]]
        if (length(vals) > 0) {
            obs[[i]] <- vals
        }
    }

    obs <- Filter(length, obs)

    obs <- switch(select, 
                  first = lapply(obs, `[[`, 1L), 
                  last = lapply(obs, function(x) x[[length(x)]]), 
                  common = {
                      counts <- table(unlist(obs))
                      tmp <- obs
                      obs <- vector("character", length(tmp))
                      names(obs) <- names(tmp)
                      for (i in seq_along(obs)) {
                          obs[i] <- names(which.max(counts[names(counts) %in% tmp[[i]]]))
                      }
                      obs
                  }, 
                  obs)

    if (isTRUE(simplify)) {
        tmp <- obs
        obs <- unlist(tmp)
        names(obs) <- make.unique(rep(names(tmp), times = lengths(tmp)))
    }

    return(obs)
}

So at least in this usecase it leads to the same result and gives a > 800x speedup.

Any chance this could be integrated @Gesmira or @dcollins15 after some further testing?

system.time(
    test1 <- labels(
        object = cmap,
        values = Cells(x = obj, layer = scale.layer)
    )
)

>>>  user  system elapsed 
>>>  53.363   0.304  54.158 

system.time(
    test2 <- optimized_labels(cmap, values = Cells(x = obj, layer = scale.layer))
)

>>>  user  system elapsed 
 >>> 0.067   0.000   0.068 

identical(test13, test14)
>>> TRUE
igrabski commented 4 weeks ago

Thanks for looking into this so thoroughly! We are currently reviewing your PR and will follow up in the near future.