satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 915 forks source link

FindMultiModalNeighbors ComputeSNN std::bad_alloc error #6473

Closed erbon7 closed 2 years ago

erbon7 commented 2 years ago

Hi Seurat devs,

I'm trying to analyze CITE-seq (RNA+ADT) data on a rather large dataset, but I'm having a memory error message when trying to compute the WNN graph.

here is the code:

 CD3
An object of class Seurat
36775 features across 116593 samples within 2 assays
Active assay: ADT (174 features, 174 variable features)
 1 other assay present: RNA

DefaultAssay(CD3) <- 'RNA'
CD3 <- NormalizeData(CD3) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()

DefaultAssay(CD3) <- 'ADT'
VariableFeatures(CD3) <- rownames(CD3[["ADT"]])
CD3 <- NormalizeData(CD3, normalization.method = 'CLR', margin = 2) %>% ScaleData() %>% RunPCA(reduction.name = 'apca')

CD3<- FindMultiModalNeighbors(CD3, reduction.list = list("pca", "apca"), dims.list= list(1:30, 1:18), modality.weight.name = "RNA.weight", verbose=TRUE)

The error message I get is:

Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
Calculating kernel bandwidths
Error in ComputeSNN(nn_ranked = Indices(object = nn)[, 1:s.nn], prune = prune.SNN) :
  std::bad_alloc
Calls: FindMultiModalNeighbors -> FindModalityWeights -> lapply -> FUN -> ComputeSNN
Execution halted

I tried on large mem machines (RAM > 500 Gb) but the error is the same. If I monitor the memory used, the process is crashing with max 34 Gb used. I tried with different versions/installations of R but the problem is the same. If I'm running the same code on a smaller object (bmcite object from SeuratData) the program is not crashing.

session info:

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /env/export/v_cng_n02_scratch/v_scratch_LBA/eb/tools/anaconda3/envs/r-seurat/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] patchwork_1.1.2    clustree_0.5.0     ggraph_2.0.6       ggplot2_3.3.6
[5] dplyr_1.0.10       sp_1.5-0           SeuratObject_4.1.1 Seurat_4.1.1

loaded via a namespace (and not attached):
  [1] nlme_3.1-159          matrixStats_0.62.0    spatstat.sparse_2.1-1
  [4] RcppAnnoy_0.0.19      RColorBrewer_1.1-3    httr_1.4.4
  [7] sctransform_0.3.4     tools_4.0.5           utf8_1.2.2
 [10] R6_2.5.1              irlba_2.3.5           rpart_4.1.16
 [13] KernSmooth_2.23-20    uwot_0.1.14           mgcv_1.8-40
 [16] rgeos_0.5-9           lazyeval_0.2.2        colorspace_2.0-3
 [19] withr_2.5.0           tidyselect_1.1.2      gridExtra_2.3
 [22] compiler_4.0.5        progressr_0.11.0      cli_3.4.1
 [25] plotly_4.10.0         scales_1.2.1          lmtest_0.9-40
 [28] spatstat.data_2.2-0   ggridges_0.5.3        pbapply_1.5-0
 [31] goftest_1.2-3         stringr_1.4.1         digest_0.6.29
 [34] spatstat.utils_2.3-1  pkgconfig_2.0.3       htmltools_0.5.3
 [37] parallelly_1.32.1     fastmap_1.1.0         htmlwidgets_1.5.4
 [40] rlang_1.0.5           shiny_1.7.2           farver_2.1.1
 [43] generics_0.1.3        zoo_1.8-11            jsonlite_1.8.0
 [46] spatstat.random_2.2-0 ica_1.0-3             magrittr_2.0.3
 [49] Matrix_1.4-1          Rcpp_1.0.9            munsell_0.5.0
 [52] fansi_1.0.3           viridis_0.6.2         abind_1.4-5
 [55] reticulate_1.26       lifecycle_1.0.2       stringi_1.7.8
 [58] MASS_7.3-58.1         Rtsne_0.16            plyr_1.8.7
 [61] grid_4.0.5            parallel_4.0.5        listenv_0.8.0
 [64] promises_1.2.0.1      ggrepel_0.9.1         deldir_1.0-6
 [67] miniUI_0.1.1.1        lattice_0.20-45       graphlayouts_0.8.1
 [70] cowplot_1.1.1         splines_4.0.5         tensor_1.5
 [73] pillar_1.8.1          igraph_1.3.4          spatstat.geom_2.4-0
 [76] future.apply_1.9.1    reshape2_1.4.4        codetools_0.2-18
 [79] leiden_0.4.3          glue_1.6.2            data.table_1.14.2
 [82] tweenr_2.0.2          png_0.1-7             vctrs_0.4.1
 [85] httpuv_1.6.6          polyclip_1.10-0       gtable_0.3.1
 [88] RANN_2.6.1            purrr_0.3.4           spatstat.core_2.4-4
 [91] tidyr_1.2.1           scattermore_0.8       future_1.28.0
 [94] ggforce_0.3.4         mime_0.12             tidygraph_1.2.2
 [97] xtable_1.8-4          later_1.2.0           survival_3.4-0
[100] viridisLite_0.4.1     tibble_3.1.8          cluster_2.1.3
[103] globals_0.16.1        fitdistrplus_1.1-8    ellipsis_0.3.2
[106] ROCR_1.0-11
yuhanH commented 2 years ago

Hi, @erbon7 500 Gb RAM is much bigger than your data really needs, so I don't think it is a memory issue. Could you check if you can run FindNeighbors from pca and apca?

Another minor point. Since you have 170+ proteins, when you run ScaleData, you would better set do.scale to FALSE.

DefaultAssay(CD3) <- 'ADT'
VariableFeatures(CD3) <- rownames(CD3[["ADT"]])
CD3 <- NormalizeData(CD3, normalization.method = 'CLR', margin = 2) %>% ScaleData(do.scale = FALSE) %>% RunPCA(reduction.name = 'apca')
erbon7 commented 2 years ago

Hi @yuhanH Thanks for your message. I checked the clustering with FindNeighbors and it runs fine for pca but with apca I have the same error message.

Code used:

CD3 <- FindNeighbors(CD3, reduction = 'apca', dims = 1:10, verbose=TRUE)

I also tried with the option nn.method = 'annoy' but the problem remains the same.

Error message:

Computing nearest neighbor graph
Computing SNN
Error in ComputeSNN(nn_ranked = nn.ranked, prune = prune.SNN) :
  std::bad_alloc
Calls: FindNeighbors ... FindNeighbors -> FindNeighbors.default -> ComputeSNN
Execution halted
yuhanH commented 2 years ago

Hi @erbon7 I think the issue is from the apca dimensional reduction. Could you please check if there are NA or nan value in the cell embeddings of apca? You can also have a try to run UMAP from apca.

DefaultAssay(CD3) <- 'ADT'
CD3 <- RunUMAP(CD3, reduction = 'apca', dims = 1:18, reduction.name = 'adt.umap', reduction.key = 'Uadt_')

If there is something wrong in the cell embeddings of apca, you need to check the ADT counts and data matrices.

erbon7 commented 2 years ago

Hi @yuhanH Thanks for your suggestions. It seems that there are no NAs or nan values in the cell embeddings:

> Embeddings(CD3, reduction="apca") %>% is.na() %>% sum
[1] 0
> Embeddings(CD3, reduction="apca") %>% is.nan() %>% sum
[1] 0

UMAP is running fine:

> DefaultAssay(CD3) <- 'ADT'
> CD3 <- RunUMAP(CD3, reduction = 'apca', dims = 1:18, reduction.name = 'adt.umap', reduction.key = 'Uadt_')
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
14:40:20 UMAP embedding parameters a = 0.9922 b = 1.112
14:40:20 Read 116593 rows and found 18 numeric columns
14:40:20 Using Annoy for neighbor search, n_neighbors = 30
14:40:20 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
14:40:30 Writing NN index file to temp file /tmp/RtmpYLrII8/file1cfb45b53950
14:40:30 Searching Annoy index using 1 thread, search_k = 3000
14:40:59 Annoy recall = 19.72%
14:41:00 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
14:41:00 93631 smooth knn distance failures
14:41:04 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
14:41:04 Using 'irlba' for PCA
14:41:04 PCA: 2 components explained 96.42% variance
14:41:04 Scaling init to sdev = 1
14:41:04 Commencing optimization for 200 epochs, with 6617896 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
14:52:01 Optimization finished
>
erbon7 commented 2 years ago

So it means that the nearest-neighbor procedure implemented in RunUMAP (via uwot package, if I'm not mistaken) is running fine with this dataset, right ?

yuhanH commented 2 years ago

Hi @erbon7 It is wired that adt PCA embeddings can be used for UMAP, but fail for FindNeighbors. Would you mind to send a reproducible example? yhao@nygenome.org

erbon7 commented 2 years ago

Turns out after investigation of the dataset by @yuhanH that the error message was caused by lots of "empty" cells in the ADT (proteins) assay. After removing the empty cells the FindMultiModalNeighbors function is working fine.

Empty cells removal snippet code:

CD3 <- subset(CD3, subset = nCount_ADT > 100)
erbon7 commented 2 years ago

Thanks @yuhanH for your help on this case. I now close this bug report.