Open mihem opened 9 months ago
Hi, can you please post the full code you are using to run IntegrateLayers
? We're actually not going to support this in future releases, so I'd recommend not using this parameter (and splitting your layers according to the variable you'd like to integrate by).
Hi @mhkowalski thanks your for response.
Sure.
sc_merge <- IntegrateLayers(
object = sc_merge,
method = scVIIntegration,
orig.reduction = "pca",
new.reduction = "integrated.scvi",
conda_env = "~/miniconda3/envs/scvi",
group.by = "sample",
verbose = TRUE
)
I actually split it by sample
and I would like to integrate by sample
, but IntegrateLayers
does not to pick that up.
Previous related code
# merge seurat objects ------------------------------------------
sc_merge_pre <- merge(x = sc_filter[[1]], y = sc_filter[-1], merge.data = TRUE, add.cell.ids = names(sc_list))
sc_merge_pre$sample <- str_extract(colnames(sc_merge_pre), pattern = "[^_]+")
#this needs to be rejoined and then split again to get the correct names (better naming and required for integration)
sc_merge_pre <- JoinLayers(sc_merge_pre)
sc_merge_pre <- split(x = sc_merge_pre, f = sc_merge_pre$sample)
for (i in Layers(sc_merge_pre)){
sc_merge_pre[["RNA"]][[i]] <- as(sc_merge_pre[["RNA"]][[i]], "dgCMatrix")
sc_merge_pre[["RNA"]][[i]] <- write_matrix_dir(sc_merge_pre[["RNA"]][[i]], dir = paste0("matrix_final/", gsub("counts.", "", i)))
}
Not sure if this is related. But I would like to keep separate seurat objects at the beginning (as it was in Seurat v4) and them merge them at a later stage (because this is easier then with QC plotting and doublet detection etc. which should be done on each sample separately i think), so Gesmira came up with this: https://github.com/satijalab/seurat/issues/7373#issuecomment-1629249019 which worked great in my case.
And using group.by
is a solution i can live with.
But of course problematic if you want to remove that parameter. I also think this parameter should not be removed because let's say I want to integrate by another column, which defines 3 groups, but the layers were split by the 30 samples?
Thanks for the additional info! Are the results you observe identical with and without the group.by
parameter? We found that this option was likely not implemented correctly in some cases, hence the decision to remove it. If you'd like to integrate by a different column, I would recommend joining and splitting the layers according to that column.
Thanks for your repnse @mhkowalski .
I had a large object with ~500 000 cells, which is not optimal for testing.
So I produced a minimal reproducible (well if you have the h5 files, but should be the same with other samples I think) example. I mostly follow this tutorial https://satijalab.org/seurat/articles/parsebio_sketch_integration, only that I use a list of Seurat objects at the beginning.
library(Seurat)
library(Azimuth)
library(SeuratObject)
library(SeuratDisk)
library(SeuratData)
library(SeuratWrappers)
library(BPCells)
library(tidyverse)
h5_path <- list.files(pattern = ".h5", recursive = TRUE)
sc_raw_list <- lapply(h5_path, BPCells::open_matrix_10x_hdf5)
sample_names <- c("S01", "S02")
names(sc_raw_list) <- sample_names
map2(
.x = sc_raw_list,
.y = file.path("matrix", names(sc_raw_list)),
.f = BPCells::write_matrix_dir
)
sc_raw_mat <-
map(
.x = file.path("matrix", names(sc_raw_list)),
.f = open_matrix_dir
) |>
setNames(sample_names)
sc_raw_mat <-
map(
.x = sc_raw_mat,
.f = function(x) Azimuth:::ConvertEnsembleToSymbol(mat = x, species = "human")
)
sc_list <- map(sc_raw_mat, CreateSeuratObject, min.cells = 3, min.features = 200)
# filter cells and remove doublets which works better with a list of Seurat objects
# ....
# merge seurat objects now
sc_merge <- merge(x = sc_list[[1]], y = sc_list[-1], merge.data = TRUE, add.cell.ids = names(sc_list))
sc_merge$sample <- str_extract(colnames(sc_merge), pattern = "[^_]+")
#this needs to be rejoined and then split again to get the correct names (better naming and required for integration)
sc_merge <- JoinLayers(sc_merge)
sc_merge <- split(x = sc_merge, f = sc_merge$sample)
for (i in Layers(sc_merge)){
sc_merge[["RNA"]][[i]] <- as(sc_merge[["RNA"]][[i]], "dgCMatrix")
sc_merge[["RNA"]][[i]] <- write_matrix_dir(sc_merge[["RNA"]][[i]], dir = paste0("matrix_final/", gsub("counts.", "", i)))
}
sc_merge <- JoinLayers(sc_merge)
sc_merge <- split(x = sc_merge, f = sc_merge$sample)
for (i in Layers(sc_merge)){
sc_merge[["RNA"]][[i]] <- as(sc_merge[["RNA"]][[i]], "dgCMatrix")
sc_merge[["RNA"]][[i]] <- write_matrix_dir(sc_merge[["RNA"]][[i]], dir = paste0("matrix_final2/", gsub("counts.", "", i)))
}
sc_merge <- NormalizeData(sc_merge, verbose = TRUE, normalization.method = "LogNormalize", scale.factor = 10000)
sc_merge <- FindVariableFeatures(sc_merge, selection.method = "vst", nfeatures = 2000)
sc_merge <- SketchData(sc_merge, ncells = 5000, method = "LeverageScore", sketched.assay = "sketch")
DefaultAssay(sc_merge) <- "sketch"
sc_merge <- FindVariableFeatures(sc_merge)
sc_merge <- ScaleData(sc_merge)
sc_merge <- RunPCA(sc_merge)
qs::qsave(sc_merge, "sc_merge.qs")
system.time(
test1 <- IntegrateLayersDebug(
object = sc_merge,
method = scVIIntegration,
orig.reduction = "pca",
new.reduction = "integrated.scvi",
conda_env = "~/miniconda3/envs/scvi",
group.by = "sample",
verbose = TRUE
)
)
system.time(
test2 <- IntegrateLayersDebug(
object = sc_merge,
method = scVIIntegration,
orig.reduction = "pca",
new.reduction = "integrated.scvi",
conda_env = "~/miniconda3/envs/scvi",
verbose = TRUE
)
)
I create a custom function IntegrateLayersDebug
where I removed the last lines (so that it doesn't actually run the integration only the preprocessing") and then return groups
.
If you provide group.by
it's ~70x faster, although 2s are of course absolutely acceptable. But thinks BPCells is all about speed, and you have datasets that are >100 samples, this is relevant I think.
# run time of test1 (so with group.by)
user system elapsed
0.031 0.004 0.036
# runt time of test2 (so without group.by)
user system elapsed
2.292 0.285 2.570
The result has the same meaning, but it's not the same
>head(test1$group)
[1] "S01" "S01" "S01" "S01" "S01" "S01"
head(test2$group)
[1] "data.S01" "data.S01" "data.S01" "data.S01" "data.S01" "data.S01"
I think both produce the same result? At least when I provided group.by
integration looked fine.
@mhkowalski sorry any update on this?
I may have found a way to speed this process up until it is resolved.
If I drop all scaled and normalized layers, I can merge all the counts matrices then just renormalize and scale this larger matrix. Doing this was much faster than just letting JoinLayers() run. For example:
merged[["RNA"]]$data.1 <- NULL
Do this for all normalized and scaled layers. Then run JoinLayers(). You'll get a merged counts matrix that you can normalize and scale.
@mhkowalski or @Gesmira Sorry, are there any updates on this issue?
I still have the same issue, now even worse because group.by
is no longer an argument in IntegrateLayers
.
CreateIntegrationGroups
is the rate limiting step in my analysis, all previous steps (Normalize, FindVariableFeatures, Scale), and also the integration itself is faster than this rather trivial function.
Here another reproducible example:
library(Seurat)
library(SeuratData)
options(future.globals.maxSize = 1e9)
InstallData("pbmcsca")
obj <- LoadData("pbmcsca")
obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
layers <- Layers(object = obj, assay = "RNA", search = "data")
scale.layer <- Layers(object = obj, search = "scale.data")
system.time(
groups <- Seurat:::CreateIntegrationGroups(obj[["RNA"]], layers = layers, scale.layer = scale.layer)
)
This takes 50 s on my computer
>>>user system elapsed
>>> 50.069 0.280 50.476
To be more specific, the rate limiting step is:
system.time(
labels(
object = cmap,
values = Cells(x = obj, layer = scale.layer)
)
)
>>> user system elapsed
>>> 49.619 0.265 49.892
On my actualn novel dataset (120 000 cells, so 3x smaller than the previous one) only this step takes 589s (while the actual integration with harmony takes less than a minute).
So the culprit is the labels function:
function (object, values, select = c("first", "last", "common",
"all"), simplify = TRUE, ...)
{
select <- select[1L]
select <- match.arg(arg = select)
values <- intersect(x = values, y = rownames(x = object))
p <- progressor(along = values)
obs <- sapply(X = values, FUN = function(x) {
vals <- colnames(x = object)[which(x = object[x, , drop = TRUE])]
p()
return(vals)
}, simplify = FALSE, USE.NAMES = TRUE)
obs <- Filter(f = length, x = obs)
obs <- switch(EXPR = select, first = lapply(X = obs, FUN = "[[",
1L), last = lapply(X = obs, FUN = function(x) {
return(x[[length(x = x)]])
}), common = {
counts <- table(unlist(x = obs))
tmp <- obs
obs <- vector(mode = "character", length = length(x = tmp))
names(x = obs) <- names(x = tmp)
for (i in seq_along(along.with = obs)) {
obs[i] <- names(x = which.max(x = counts[names(x = counts) %in%
tmp[[i]]]))
}
obs
}, obs)
if (isTRUE(x = simplify)) {
tmp <- obs
obs <- unlist(x = tmp)
names(x = obs) <- make.unique(names = rep.int(x = names(x = tmp),
times = vapply(X = tmp, FUN = length, FUN.VALUE = numeric(length = 1L))))
}
return(obs)
}
<bytecode: 0x557ceaaba5b0>
<environment: namespace:SeuratObject>
With the help of ChatGPT i tried to optimize the function:
optimized_labels <- function(object, values, select = c("first", "last", "common", "all"), simplify = TRUE, ...) {
select <- select[1L]
select <- match.arg(arg = select)
values <- intersect(values, rownames(object))
if (length(values) == 0) {
return(character(0))
}
# Precompute logical matrix and column names
cmap_data <- as.matrix(object)
colnames_object <- colnames(object)
rownames_object <- rownames(object)
# Get row indices for values
row_indices <- match(values, rownames_object)
# Initialize the list to store results
obs <- vector("list", length(values))
names(obs) <- values
# Direct indexing to replace sapply
for (i in seq_along(row_indices)) {
row_idx <- row_indices[i]
vals <- colnames_object[cmap_data[row_idx, , drop = FALSE]]
if (length(vals) > 0) {
obs[[i]] <- vals
}
}
obs <- Filter(length, obs)
obs <- switch(select,
first = lapply(obs, `[[`, 1L),
last = lapply(obs, function(x) x[[length(x)]]),
common = {
counts <- table(unlist(obs))
tmp <- obs
obs <- vector("character", length(tmp))
names(obs) <- names(tmp)
for (i in seq_along(obs)) {
obs[i] <- names(which.max(counts[names(counts) %in% tmp[[i]]]))
}
obs
},
obs)
if (isTRUE(simplify)) {
tmp <- obs
obs <- unlist(tmp)
names(obs) <- make.unique(rep(names(tmp), times = lengths(tmp)))
}
return(obs)
}
So at least in this usecase it leads to the same result and gives a > 800x speedup.
Any chance this could be integrated @Gesmira or @dcollins15 after some further testing?
system.time(
test1 <- labels(
object = cmap,
values = Cells(x = obj, layer = scale.layer)
)
)
>>> user system elapsed
>>> 53.363 0.304 54.158
system.time(
test2 <- optimized_labels(cmap, values = Cells(x = obj, layer = scale.layer))
)
>>> user system elapsed
>>> 0.067 0.000 0.068
identical(test13, test14)
>>> TRUE
Thanks for looking into this so thoroughly! We are currently reviewing your PR and will follow up in the near future.
Using
IntegrateLayers
without specifyinggroup.by
is slow. With my ~ 400 000 cells dataset, only the "preprocessing" before the actual integration, takes rather long (~10 min).If I specify
group.by
this "preprocessing" only takes a few seconds. When I tried to debug this, I've found that thelabels
function is the culprit. https://github.com/satijalab/seurat/blob/678141f74c3bada0fc2b599378430d1ba46b441d/R/integration5.R#L448-L450So maybe find some faster alternative code for that or maybe even force the user to specify
group.by
(don't really see the advantage in making this argument optional), Maybe @Gesmira?Thank you!