prabhakarlab / Banksy

BANKSY: spatial clustering
https://prabhakarlab.github.io/Banksy
Other
74 stars 12 forks source link

Running Banksy on large Xenium Dataset #39

Closed Alwash-317 closed 2 weeks ago

Alwash-317 commented 2 months ago

Hi,

I’m working with an integrated Xenium dataset consisting of 12 samples, totaling approximately 5.4 million cells. After pre-processing the individual Xenium samples, I merged them into a single Seurat object for downstream analysis. However, I’m encountering issues when trying to run BANKSY due to the large size of the dataset. The R script is as follows:

`file_paths <- c( "path_1", "path_2", ..., "path_12")

sample_names <- c( "sample_1", "sample_2", ..., "sample_12")

seu_list <- list()

for (i in seq_along(file_paths)) { seu <- readRDS(filepaths[i]) coords <- seu[[paste0("fov", sample_names[i])]]$centroids@coords seu$sdimx <- coords[, 1] seu$sdimy <- coords[, 2] seu_list[[i]] <- seu }

merged_seu <- Reduce(merge, seu_list)

merged_seu <- JoinLayers(merged_seu)

DefaultAssay(merged_seu) <- "Xenium"

merged_seu <- RunBanksy( merged_seu, lambda = 0.8, assay = 'Xenium', slot = 'data', features = 'all', group = 'Sample_ID', dimx = 'sdimx', dimy = 'sdimy', split.scale = TRUE, k_geom = 15)`

And it crashes at the RunBanksy step with the following log error:

Error in [.data.table(knn_df, , abs(gcm[, to, drop = FALSE] %*% (weight *  : 
  negative length vectors are not allowed
Calls: RunBanksy ... mapply -> <Anonymous> -> <Anonymous> -> [ -> [.data.table
In addition: Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 19.3 GiB
Execution halted.

I attempted to allocate more memory for the script (up to 800 GB), and monitored memory usage, which didn’t exceed this limit at the time of the crash. I also used the future package with the setting options(future.globals.maxSize = 256 * 1024^3), but the issue persists.

Given the size of the dataset, are there any computationally less intensive approaches or optimizations you would recommend for running BANKSY on such large datasets? Any suggestions to handle memory usage more efficiently or alternative strategies would be greatly appreciated.

Thank you for your help!

vipulsinghal02 commented 1 month ago

Hi Alwash, have you tried using hightly variable genes (2000 HVGs in Seurat)? Another optimization to try is to first do HVG to 2000 genes, then further reduce to 100 PCs. this 100PC by 5.4 million cell matrix is now your new feature-cell matrix ("gene"-cell matrix), and you start the usual of the pipeline on this.

This should greatly reduce dataset size and allow for the processing. Another idea is to use the BPcells package (which seurat supports, see their pages/vignettes).

Best, Vipul

vipulsinghal02 commented 1 month ago

Also:

  1. construct the BANKSY matrix separately for each sample, merge them, and run PCA on the merged matrix.
  2. See this: https://github.com/prabhakarlab/Banksy_py/issues/12#issuecomment-2268114768

Let me know how it goes! Best, Vipul