satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 908 forks source link

Large Integration Causes RSession Abort #2905

Closed AndrewSkelton closed 1 year ago

AndrewSkelton commented 4 years ago

Hi,

First and foremost, thanks for the hard work in making such a lovely framework to analyse with!

My problem is around 65 10x samples that I'm trying to integrate, which comes to around 1M Cells. I'll outline my code, session info, and machine stats below. This code runs fine with 1/3 of the data (~18 samples, ~330,000 Cells) in about an hour post SCT.

Methodology Each Sample ran through SCT, then Integration via SelectIntegrationFeatures, PrepSCTIntegration, RunPCA, FindIntegrationAnchors (rpca mode with reference), IntegrateData.

Expected Behaviour - Integration Anchors generated Actual Behaviour - RSession Abort / Crash. No error message produced

Machine 0.5TB Memory, 16 Cores

Code

1. Run SCT on Each Sample

var_files_in <- list.files("../Data/GRCh38/", ".h5", full.names = T)
for(j in 1:length(var_files_in)) {
  var_file                 <- var_files_in[j]
  var_file_nme             <- basename(var_file) %>% gsub(".filtered_gene_bc_matrices_h5.h5","",.)
  message(paste0("Processing Sample ",j," of ",length(var_files_in)," - ", var_file_nme))
  data_10x                 <- Read10X_h5(var_file)
  data_seu                 <- CreateSeuratObject(counts = data_10x, project = var_file_nme, min.cells = 3, min.features = 200)
  ## Cell wise filtering steps
  data_seu                 <- SCTransform(data_seu, vars.to.regress = c("percent_mt"), verbose = T, variable.features.n = 3000)
  ## Save Object
}

2. Import and Integrate

var_files_in <- list.files("../RData/", "SingleSample__", full.names = T) 
for(j in 1:length(var_files_in)) {
  var_file                 <- var_files_in[j]
  var_file_nme             <- basename(var_file) %>% gsub("SingleSample__|.RData","",.)
  load(var_file)
  data_list[[var_file_nme]] <- data_seu_list$Seurat
}
var_dims                  <- 1:30
var_features              <- SelectIntegrationFeatures(object.list = data_list, nfeatures = 3000)
data_list                 <- PrepSCTIntegration(object.list = dataist, anchor.features = var_features)
data_list                 <- lapply(data_list, FUN = RunPCA, verbose = F, features = var_features)
var_ref_sample            <- which(names(data_list) == "norm_1_1") 
data_anchors              <- FindIntegrationAnchors(object.list          = data_list,
                                                    normalization.method = "SCT",
                                                    anchor.features      = var_features,
                                                    reference            = var_ref_sample,
                                                    reduction            = "rpca")
rm(data_list); gc()
data_integrated <- IntegrateData(anchorset = data_anchors, normalization.method = "SCT")
data_integrated <- RunPCA(data_integrated, verbose = F)
data_integrated <- RunUMAP(data_integrated, dims = 1:30)

The code above fails at the FindIntegrationAnchors step - Doesn't progress to any sample by sample operations via messages in the console, and then after about an hour it crashes the RSession. The above code works perfectly on less samples, and I'm doubtful this is a memory error. Any way to narrow down what the problem might be?

Thanks for any help!

Andrew

Session Info

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SingleR_1.1.11              SummarizedExperiment_1.16.1 DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.56.0          Biobase_2.46.0              GenomicRanges_1.38.0       
 [8] GenomeInfoDb_1.22.1         IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0         ggrepel_0.8.2               ggsci_2.9                   Seurat_3.1.5               
[15] forcats_0.5.0               stringr_1.4.0               dplyr_0.8.5                 purrr_0.3.4                 readr_1.3.1                 tidyr_1.0.2                 tibble_3.0.1               
[22] ggplot2_3.3.0               tidyverse_1.3.0            

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                  backports_1.1.6               AnnotationHub_2.18.0          BiocFileCache_1.10.2          plyr_1.8.6                    igraph_1.2.5                  lazyeval_0.2.2               
  [8] splines_3.6.2                 listenv_0.8.0                 digest_0.6.25                 htmltools_0.4.0               gdata_2.18.0                  fansi_0.4.1                   memoise_1.1.0                
 [15] magrittr_1.5                  cluster_2.1.0                 ROCR_1.0-7                    globals_0.12.5                modelr_0.1.6                  colorspace_1.4-1              rappdirs_0.3.1               
 [22] blob_1.2.1                    rvest_0.3.5                   haven_2.2.0                   xfun_0.13                     crayon_1.3.4                  RCurl_1.98-1.2                jsonlite_1.6.1               
 [29] survival_3.1-12               zoo_1.8-7                     ape_5.3                       glue_1.4.0                    gtable_0.3.0                  zlibbioc_1.32.0               XVector_0.26.0               
 [36] leiden_0.3.3                  future.apply_1.5.0            scales_1.1.0                  DBI_1.1.0                     Rcpp_1.0.4.6                  isoband_0.2.1                 xtable_1.8-4                 
 [43] viridisLite_0.3.0             reticulate_1.15               bit_1.1-15.2                  rsvd_1.0.3                    tsne_0.1-3                    htmlwidgets_1.5.1             httr_1.4.1                   
 [50] gplots_3.0.3                  RColorBrewer_1.1-2            ellipsis_0.3.0                ica_1.0-2                     pkgconfig_2.0.3               farver_2.0.3                  uwot_0.1.8                   
 [57] dbplyr_1.4.3                  utf8_1.1.4                    AnnotationDbi_1.48.0          later_1.0.0                   tidyselect_1.0.0              labeling_0.3                  rlang_0.4.5                  
 [64] reshape2_1.4.4                BiocVersion_3.10.1            munsell_0.5.0                 cellranger_1.1.0              tools_3.6.2                   cli_2.0.2                     RSQLite_2.2.0                
 [71] generics_0.0.2                ExperimentHub_1.12.0          broom_0.5.5                   ggridges_0.5.2                fastmap_1.0.1                 evaluate_0.14                 yaml_2.2.1                   
 [78] npsurv_0.4-0                  bit64_0.9-7                   knitr_1.28                    fs_1.4.1                      fitdistrplus_1.0-14           caTools_1.18.0                RANN_2.6.1                   
 [85] pbapply_1.4-2                 future_1.17.0                 nlme_3.1-147                  mime_0.9                      xml2_1.3.1                    compiler_3.6.2                rstudioapi_0.11              
 [92] interactiveDisplayBase_1.24.0 curl_4.3                      plotly_4.9.2.1                png_0.1-7                     lsei_1.2-0                    reprex_0.3.0                  stringi_1.4.6                
 [99] RSpectra_0.16-0               lattice_0.20-41               Matrix_1.2-18                 vctrs_0.2.4                   pillar_1.4.3                  lifecycle_0.2.0               BiocManager_1.30.10          
[106] lmtest_0.9-37                 RcppAnnoy_0.0.16              BiocNeighbors_1.4.2           data.table_1.12.8             cowplot_1.0.0                 bitops_1.0-6                  irlba_2.3.3                  
[113] httpuv_1.5.2                  patchwork_1.0.0               R6_2.4.1                      promises_1.1.0                KernSmooth_2.23-16            gridExtra_2.3                 codetools_0.2-16             
[120] MASS_7.3-51.5                 gtools_3.8.2                  assertthat_0.2.1              withr_2.1.2                   sctransform_0.2.1             GenomeInfoDbData_1.2.2        mgcv_1.8-31                  
[127] hms_0.5.3                     grid_3.6.2                    rmarkdown_2.1                 DelayedMatrixStats_1.8.0      Rtsne_0.15                    shiny_1.4.0.2                 lubridate_1.7.8              
[134] base64enc_0.1-3     
satijalab commented 4 years ago

Thanks for this - we had a few questions for debugging.

  1. Does this work with LogNormalization rather than SCT?

  2. If you run this in the terminal, rather than RStudio, can you see what error message is thrown?

  3. Are you able to merge the 65 objects together (not even worrying about integration)?

AndrewSkelton commented 4 years ago

Thanks for the direction. I started with question 3 (made most sense to me, and easy to implement). Running a merge in the terminal and on RStudio results in the same issue. RStudio produces the same RSession abort, and the terminal session produces the below.

> foo <- merge(x = data_list[[1]], y = data_list[-1])
zsh: killed     R 

So that obviously narrows the problem down but pretty sure I can't use the traceback function if it kills the session.

I tried increasing the memory to ~1TB (by setting ~/.Renviron to R_MAX_VSIZE=1000Gb). Error persists.

So I'm a bit stumped.

AndrewSkelton commented 4 years ago

Small update - I've narrowed it to 20 samples (404,327 cells) that can be successfully merged before this issue kicks in.

satijalab commented 4 years ago

Thanks so much - just to clarify again which helps us debug, does your example work successfully (using the full dataset) using the LogNormalize workflow?

AndrewSkelton commented 4 years ago

The logNormalise workflow successfully generates an AnchorSet for all 65 samples using rpca with a single reference sample. Unfortunately it crashes on the IntegrateData step. Same kind of error, RSession abort.

dlmatera commented 3 years ago

did this ever get figured out? I am getting a similar issue with the Rsession aborting during SCTransform & it only happens on the last step & for my dataset w/ 200k cells (but not with the one with 150k cells)

histonemark commented 3 years ago

Hi, thanks for the wonderful package. I am having the same issue with ~100k cells

dlmatera commented 3 years ago

Hi, thanks for the wonderful package. I am having the same issue with ~100k cells

i ended up having to use 500GB ram and it started working, even though I should have had enough before (120gb of ram for a 40gb dataset)

hoondy commented 3 years ago

I have a similar issue but with a different error message. LogNormalization works fine but it gives an error at "IntegrateData" if I use SCT.

Here's code:

...
ref.list <- lapply(X = ref.list, FUN = function(x) {
    x <- suppressWarnings(SCTransform(x, verbose = FALSE))
    })

sct.features <- SelectIntegrationFeatures(object.list = ref.list, nfeatures = 3000, verbose = FALSE)
ref.list <- PrepSCTIntegration(object.list = ref.list, anchor.features = sct.features)

ref.list <- lapply(X = ref.list, FUN = function(x) {
    x <- RunPCA(x, features = sct.features, verbose = FALSE)
})

sct.anchors <- FindIntegrationAnchors(object.list = ref.list, normalization.method = "SCT", reduction = "rpca", anchor.features = sct.features, verbose = FALSE)

integrated <- IntegrateData(anchorset = sct.anchors, normalization.method = 'SCT', dims = 1:50, verbose = FALSE)

Here's error msg:

Error in RowMergeMatricesList(mat_list = all.mat, mat_rownames = all.rownames, : Need S4 class dgRMatrix for a sparse matrix
Traceback:

1. IntegrateData(anchorset = sct.anchors, normalization.method = "SCT", 
 .     dims = 1:50, verbose = FALSE)
2. PairwiseIntegrateReference(anchorset = anchorset, new.assay.name = new.assay.name, 
 .     normalization.method = normalization.method, features = features, 
 .     features.to.integrate = features.to.integrate, dims = dims, 
 .     k.weight = k.weight, weight.reduction = weight.reduction, 
 .     sd.weight = sd.weight, sample.tree = sample.tree, preserve.order = preserve.order, 
 .     eps = eps, verbose = verbose)
3. merge(x = object.1, y = object.2, merge.data = TRUE)
4. merge.Seurat(x = object.1, y = object.2, merge.data = TRUE)
5. merge(x = assays.merge[[1]], y = assays.merge[2:length(x = assays.merge)], 
 .     merge.data = merge.data)
6. merge.Assay(x = assays.merge[[1]], y = assays.merge[2:length(x = assays.merge)], 
 .     merge.data = merge.data)
7. RowMergeSparseMatrices(mat1 = counts.mats[[1]], mat2 = counts.mats[2:length(x = counts.mats)])
8. RowMergeMatricesList(mat_list = all.mat, mat_rownames = all.rownames, 
 .     all_rownames = all.names)
IreneP19 commented 3 years ago

Hi, I'm having a similar issue while integrating 6 10x datasets (size of list with the 6 objects: 22.9gb)

I can run the NormalizeData + FindVariableFeatures but I get "R session abort - fatal error" while running the FindIntegrationAnchors. Also planned to try FindIntegrationAnchors with reduction = "rpca" but I run in a fatal error again during ScaleData.

I'm running on a machine with 12 cores and 128gb RAM, latest versions of R and Seurat.

I am actually using a published dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161621) and their code (https://github.com/neurojacob/blum_et_al_2021/blob/master/paper_analysis_jupyter.ipynb) so it's supposed to work... But I also tried to follow strictly the Seurat vignettes with the same results.

Any help would be appreciated. I don't know if I should try a more powerful machine for instance.

vwtlin commented 3 years ago

Had a similar problem when working with 200k+ cells. It went through the FindIntegrationAnchors step using CCA, although it took a few hours. And then Rstuido aborts during the integration step.

mode1990 commented 3 years ago

I've got also the same problem trying to integrate data of 200K nuclei from patients with different driver mutations. Everything works well up until the "FindIntegrationAnchors" function, I've tried RStudio and R (running under Ubuntu16), both do not work and keep returning "cannot allocate buffer" or "this vector is too large". I'm really stumped and would look forward to getting an update from anyone who has figured this out. Thanks!

hopehealey commented 2 years ago

I ran into a similar problem with 10K cells, has anyone figured out a solution to this?

giorgiatosoni commented 2 years ago

Hi, I'm having the same problem, R session aborted during the integration step (12 datasets, 75K nuclei).

sscien commented 2 years ago

Problem too large error when merging 2M cells

moverm0210 commented 2 years ago

I'm having the same problem here with 230k cells. I'm testing SCTransform+harmony

caparada commented 2 years ago

Same problem here. 340K cells Has it ever been solved? I am stuck.

RYY0722 commented 1 year ago

Same problem, 84 k cells with around 169 G RAM on a node of HPC cluster.

oyxf commented 1 year ago

Same problem, 68K cells with around 150G RAM

keggleigh commented 1 year ago

Same problem, failing at IntegrateData step for 17 datasets (~460K cells), although the function runs for 6 processes then fails. Running on HPC cluster, 120Gb RAM. Using RPCA method with SCTransform.

Gesmira commented 1 year ago

Hi all, please refer to the new beta release of Seurat v5 for support for analyzing and integrating large datasets!