Closed Alwash-317 closed 2 weeks ago
Hi Alwash, have you tried using hightly variable genes (2000 HVGs in Seurat)? Another optimization to try is to first do HVG to 2000 genes, then further reduce to 100 PCs. this 100PC by 5.4 million cell matrix is now your new feature-cell matrix ("gene"-cell matrix), and you start the usual of the pipeline on this.
This should greatly reduce dataset size and allow for the processing. Another idea is to use the BPcells package (which seurat supports, see their pages/vignettes).
Best, Vipul
Also:
Let me know how it goes! Best, Vipul
Hi,
I’m working with an integrated Xenium dataset consisting of 12 samples, totaling approximately 5.4 million cells. After pre-processing the individual Xenium samples, I merged them into a single Seurat object for downstream analysis. However, I’m encountering issues when trying to run BANKSY due to the large size of the dataset. The R script is as follows:
`file_paths <- c( "path_1", "path_2", ..., "path_12")
sample_names <- c( "sample_1", "sample_2", ..., "sample_12")
seu_list <- list()
for (i in seq_along(file_paths)) { seu <- readRDS(filepaths[i]) coords <- seu[[paste0("fov", sample_names[i])]]$centroids@coords seu$sdimx <- coords[, 1] seu$sdimy <- coords[, 2] seu_list[[i]] <- seu }
merged_seu <- Reduce(merge, seu_list)
merged_seu <- JoinLayers(merged_seu)
DefaultAssay(merged_seu) <- "Xenium"
merged_seu <- RunBanksy( merged_seu, lambda = 0.8, assay = 'Xenium', slot = 'data', features = 'all', group = 'Sample_ID', dimx = 'sdimx', dimy = 'sdimy', split.scale = TRUE, k_geom = 15)`
And it crashes at the
RunBanksy
step with the following log error:I attempted to allocate more memory for the script (up to 800 GB), and monitored memory usage, which didn’t exceed this limit at the time of the crash. I also used the future package with the setting options(future.globals.maxSize = 256 * 1024^3), but the issue persists.
Given the size of the dataset, are there any computationally less intensive approaches or optimizations you would recommend for running BANKSY on such large datasets? Any suggestions to handle memory usage more efficiently or alternative strategies would be greatly appreciated.
Thank you for your help!