satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

Severe decrease in processing speed after CRAN v5 update #8127

Open Dario-Rocha opened 10 months ago

Dario-Rocha commented 10 months ago

Hello dear Seurat team, I've been using Seurat v5 for over half a year now, and after the official release, some scripts that would run in under half an hour now simply can't be run in a whole working day. One of the steps that has gotten significantly slower is "Calculating Leverage Scores" during SketchData (which now can take over 10 hours) and NormalizeData (which used to be almost instantaneous and now takes almost a minute). I've noticed that during this and other processes, no swap memory is being used, which kind of hints me that the parallelization isn't working anymore, or that for some reason R+Rstudio are refusing to use swap.

plan(multisession, workers = 14, gc = TRUE)
options(future.globals.maxSize = 3e+09)

R version 4.3.2 (2023-10-31) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.3.1

jcorn427 commented 10 months ago

I've been running into the same issue. I was wondering if it was just me, but "Calculating Leverage Scores" seems to take forever.

rsatija commented 10 months ago

Thanks for pointing this out - are you observing this behavior also on any of our example datasets (either small or large)? If you're able to provide an example that we can debug, we will figure out what is going on here

dango147 commented 10 months ago

I thought I was the only one. Two days ago, I updated to v5.0.1 because of the highlighting issue in DimPlot and since then, things that used to run in <3 minutes run for hours. I also started running out of memory when I run PCAs, which had never been an issue before (if I remember correctly, I never needed that much memory). I work with ~100K cells, have 32 GB of RAM, and load counts from disk using BPCells. Here is part of my script (I don't know how helpful it's going to be, but just in case).

library(Seurat) library(SeuratWrappers) options(Seurat.object.assay.version = "v3")

main_dir <- "somedir" setwd(main_dir) sample_dirs <- list.dirs(main_dir, full.names = TRUE, recursive = FALSE) ldat <- list()

for (sample_dir in sample_dirs) { if (grepl("^Pig", basename(sample_dir))) { loom_path <- file.path(sample_dir, "velocyto", paste0(basename(sample_dir), ".loom")) if (file.exists(loom_path)) { bm <- ReadVelocity(file = loom_path)

rownames(bm$spliced)=make.unique(rownames(bm$spliced), sep = "_")

  # rownames(bm$unspliced)=make.unique(rownames(bm$unspliced), sep = "_")
  # rownames(bm$ambiguous)=make.unique(rownames(bm$ambiguous), sep = "_")
  bm <- as.Seurat(bm)
  bm[["Sample"]] <- basename(sample_dir)
  ldat[[basename(sample_dir)]] <- bm
  rm(bm)
}

} }

combined_spliced = merge(ldat[[1]], y = ldat[-1])

rm(ldat)

library(BPCells) options(Seurat.object.assay.version = "v5") options(future.globals.maxSize = 1e9)

combined_spliced = UpdateSeuratObject(combined_spliced)

Metadata = combined_spliced@meta.data

dir_spliced <- file.path(getwd(), "spliced_BP") dir_unspliced <- file.path(getwd(), "unspliced_BP") dir_ambiguous <- file.path(getwd(), "ambiguous_BP")

write_matrix_dir(mat = combined_spliced[["spliced"]]$counts, dir = dir_spliced) write_matrix_dir(mat = combined_spliced[["unspliced"]]$counts, dir = dir_unspliced) write_matrix_dir(mat = combined_spliced[["ambiguous"]]$counts, dir = dir_ambiguous)

rm(combined_spliced)

load('Metadata_Combined_BP.Rdata') mat = open_matrix_dir(dir = dir_spliced) Combined_BP = CreateSeuratObject(counts = mat, meta.data = Metadata, assay = "spliced") Combined_BP[["unspliced"]] = CreateAssay5Object(counts = open_matrix_dir(dir = dir_unspliced)) Combined_BP[["ambiguous"]] = CreateAssay5Object(counts = open_matrix_dir(dir = dir_ambiguous))

######### INTEGRATE BASED ON SPLICED ASSAY ###################

Combined_BP[["spliced"]] <- split(Combined_BP[["spliced"]], f = Combined_BP$Sample)

Combined_BP <- NormalizeData(Combined_BP)

Combined_BP <- FindVariableFeatures(Combined_BP, nfeatures = 3000)

Combined_BP <- ScaleData(Combined_BP, features = rownames(Combined_BP))

Combined_BP <- RunPCA(Combined_BP, npcs = 30) <----- CRASH

[...] Continues ...

##############################################################################

NOTE: The session reported below doesn't include the first part of the script. I only loaded the matrices that I had generated on a previous iteration, created the seurat object, and processed it until the RunPCA line, when it crashed after a few minutes.

sessionInfo() R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] BPCells_0.1.0 Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-2

loaded via a namespace (and not attached): [1] spam_2.10-0 plyr_1.8.9 igraph_1.6.0
[4] lazyeval_0.2.2 splines_4.2.3 RcppHNSW_0.5.0
[7] BiocParallel_1.32.6 listenv_0.9.0 scattermore_1.2
[10] usethis_2.2.2 GenomeInfoDb_1.34.9 ggplot2_3.4.4
[13] digest_0.6.33 htmltools_0.5.7 fansi_1.0.6
[16] magrittr_2.0.3 memoise_2.0.1 tensor_1.5
[19] cluster_2.1.4 ROCR_1.0-11 remotes_2.4.2.1
[22] globals_0.16.2 Biostrings_2.66.0 matrixStats_1.1.0
[25] spatstat.sparse_3.0-3 colorspace_2.1-0 ggrepel_0.9.4
[28] dplyr_1.1.4 crayon_1.5.2 RCurl_1.98-1.13
[31] jsonlite_1.8.8 spatstat.data_3.0-3 progressr_0.14.0
[34] survival_3.5-3 zoo_1.8-12 glue_1.6.2
[37] polyclip_1.10-6 gtable_0.3.4 zlibbioc_1.44.0
[40] XVector_0.38.0 leiden_0.4.3.1 DelayedArray_0.24.0
[43] pkgbuild_1.4.3 future.apply_1.11.0 BiocGenerics_0.44.0
[46] abind_1.4-5 scales_1.3.0 spatstat.random_3.2-2
[49] miniUI_0.1.1.1 Rcpp_1.0.11 viridisLite_0.4.2
[52] xtable_1.8-4 reticulate_1.34.0 dotCall64_1.1-1
[55] stats4_4.2.3 profvis_0.3.8 htmlwidgets_1.6.4
[58] httr_1.4.7 RColorBrewer_1.1-3 ellipsis_0.3.2
[61] ica_1.0-3 urlchecker_1.0.1 pkgconfig_2.0.3
[64] XML_3.99-0.16 uwot_0.1.16 deldir_2.0-2
[67] utf8_1.2.4 tidyselect_1.2.0 rlang_1.1.2
[70] reshape2_1.4.4 later_1.3.2 munsell_0.5.0
[73] tools_4.2.3 cachem_1.0.8 cli_3.6.2
[76] generics_0.1.3 devtools_2.4.5 ggridges_0.5.4
[79] stringr_1.5.1 fastmap_1.1.1 goftest_1.2-3
[82] yaml_2.3.8 fs_1.6.3 fitdistrplus_1.1-11
[85] purrr_1.0.2 RANN_2.6.1 nlme_3.1-162
[88] pbapply_1.7-2 future_1.33.0 mime_0.12
[91] compiler_4.2.3 rstudioapi_0.15.0 plotly_4.10.3
[94] png_0.1-8 spatstat.utils_3.0-4 tibble_3.2.1
[97] stringi_1.8.2 RSpectra_0.16-1 lattice_0.20-45
[100] Matrix_1.6-3 vctrs_0.6.5 pillar_1.9.0
[103] lifecycle_1.0.4 spatstat.geom_3.2-7 lmtest_0.9-40
[106] RcppAnnoy_0.0.21 data.table_1.14.10 cowplot_1.1.1
[109] bitops_1.0-7 irlba_2.3.5.1 httpuv_1.6.13
[112] patchwork_1.1.3 rtracklayer_1.58.0 GenomicRanges_1.50.2
[115] R6_2.5.1 BiocIO_1.8.0 promises_1.2.1
[118] KernSmooth_2.23-20 gridExtra_2.3 IRanges_2.32.0
[121] parallelly_1.36.0 sessioninfo_1.2.2 codetools_0.2-19
[124] fastDummies_1.7.3 MASS_7.3-58.2 pkgload_1.3.3
[127] SummarizedExperiment_1.28.0 rjson_0.2.21 GenomicAlignments_1.34.1
[130] sctransform_0.4.1 Rsamtools_2.14.0 S4Vectors_0.36.2
[133] GenomeInfoDbData_1.2.9 parallel_4.2.3 grid_4.2.3
[136] tidyr_1.3.0 MatrixGenerics_1.10.0 Rtsne_0.17
[139] spatstat.explore_3.2-5 Biobase_2.58.0 shiny_1.8.0
[142] restfulr_0.0.15

rsatija commented 10 months ago

would you be able to share the seurat object where the RunPCA step crashes, or alternately, share the loom file? you can send the link to Seurat Help seuratpackage@gmail.com, and we will certainly take a look

dango147 commented 10 months ago

would you be able to share the seurat object where the RunPCA step crashes, or alternately, share the loom file? you can send the link to Seurat Help seuratpackage@gmail.com, and we will certainly take a look

Done

dango147 commented 10 months ago

Sorry, this could be helpful to someone. I just ran the same code in our server, which still had the beta v5 installed (v4.9.9.9060), and it completed the RunPCA step using ~8 GB of RAM, so I don't know what happened after I updated Seurat on my laptop. When I had the Beta version, it used to work even better than our server.

UPDATE: I know it's not the right way to do it, but it's the only way I could come up with. As I couldn't find a way to re-install the beta v5 version on my laptop, I compressed the Seurat and SeuratObject libraries I had in the server and installed them on my laptop. Then, I downgraded Matrix to v1.6.1, and now everything is working as it used to.

Dario-Rocha commented 10 months ago

well, I need to move forward with my project so I need downgrading for the moment, Could you be so kind, @dango147, to share the older Seurat and SeuratObject libraries that are working fine for you? also any hints for successfully downgrading to Matrix v1.6.1?

dango147 commented 10 months ago

well, I need to move forward with my project so I need downgrading for the moment, Could you be so kind, @dango147, to share the older Seurat and SeuratObject libraries that are working fine for you? also any hints for successfully downgrading to Matrix v1.6.1?

Sure. First, I removed the Seurat, SeuratObject and Matrix libraries from the Rstudio package menu. Then, I restarted R and ran "devtools::install_version("Matrix",version = "1.6.1")". Once Matrix was installed, I manually installed the two Seurat packages from these two zip files

SeuratObject.zip Seurat.zip

I don't know how helpful this will be as we could be using completely different environments, but I hope it helps! There is nothing else I can do

Dario-Rocha commented 10 months ago

Somehow can't downgrade the Matrix package, also I am on a mac and the binaries you kindly provided are for windows, thank you a lot for the effort anyway! hopefully we can get some kind of quick and temporary official solution

jcorn427 commented 10 months ago

Is there any way to access the beta releases still? I have a project with 1.36 million cells and the "Calculating Leverage Score" step hasn't finished after letting it run for multiple days. Any way to speed this up would be much appreciated

sknaack commented 8 months ago

Hi All, After updating to Seurat v5 a few weeks ago I am experiencing the issue described here when running Seurat functions within RStudio on an M2 Ultra processor machine (a Mac Studio). I don't believe the future parallelization is being implemented when I call it for "multisession" or "multicore" status on my 24 cpus. Has this been resolved? Is there an update to Seurat (or version of future or RStudio?) that I should look for? I might have to try downgrading.

jtourig commented 8 months ago

Curious also if the {future} parallelization applies to Seurat v5, and what sort of performance gains we should expect. The v4.3 documentation does not apply exactly--e.g. can enable multisession but not multiprocess--and I'm not clear it's actually implemented in v5.

Currently working with a modest dataset of ~10K features across ~180K cells, and iterating over the workflow testing different parameters is awfully slow.

sknaack commented 8 months ago

In case it's helpful to note for anyone: I've had much better luck running Seurat v5 processes outside of Rstudio, i.e., from command line scripts given as arguments to Rscript and generally using a multicore plan() in future. Pretty much same code and libraries, but much better processing time. I suspect this is down to hitches using future in Rstudio on OSX. For now I'm doing well via command-line. If anyone has advice for how to best set up future (multicore vs multisession vs multiprocessor) in R as run on M-class processors in OS X I'd be curious.

philjurm commented 5 months ago

Does anybody have any updates on this? I'm using Seurat 5.1.0, R 4.3.2 and SketchData is running for 4 days straight now. I have an extensive dataset of 250+ samples, but the cell number should be manageable (1.4 million). Running directly in R (not RStudio) and using Future's multicore implementation did not improve this, although it does not seem like SketchData does even utilize multiple cores since I only see a single R process running. Any help would be appreciated!

Dario-Rocha commented 4 months ago

I've just tried after updating to seurat 5.1.0 and the issue is still the same on a 1.3 million cells and 92 samples dataset.

jfwhalen commented 4 months ago

I have this same issue and it seems like it's this line that causes the slow down, about ~2 hours per layer in my data. try( expr = VariableFeatures(object = sketched, method = "sketch", layer = lyr) <- VariableFeatures(object = object[[assay]], layer = lyr), silent = FALSE )

Pentayouth commented 1 month ago

normalizedata slow findvariable feature slow scaledata slow runpca slow sketch data slow (run sketchdata before i sleep and find it stilling running after i wake up.) everything get slow after bpcell and a large data set

JoToCu commented 1 month ago

Has any found a solution to this yet? Been calculating Leverage Scores for hours. Using Seurat v5.1.0.

eviho commented 1 month ago

Hi, do we have a solution for this issue? I am now trying to use SketchData() on a dataset with ~660,000 cells and ~38,000 features. It cannot complete within 24 hours.

eviho commented 3 weeks ago

Hi, do we have a solution for this issue? I am now trying to use SketchData() on a dataset with ~660,000 cells and ~38,000 features. It cannot complete within 24 hours.

If of interest, the SketchData() function completed in ~44 hours and used up to ~96GB of memory RAM.

SBATCH --nodes=1

SBATCH --ntasks=1

SBATCH --cpus-per-task=2