satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

future seed issue with SCtransform #4852

Closed rbutleriii closed 3 years ago

rbutleriii commented 3 years ago

SCtransform doesn't use future.seed=T in its lapply functions, resulting in an error and warnings. Cannot pass it in as an argument, goes to unused argument in vst().

> plan("multiprocess", workers=16)
> options(future.globals.maxSize=2000*1024^2)
> sc.list = SplitObject(sc, split.by="group")
>
> # SCTransform on each group
> sc.list = lapply(X=sc.list, FUN=SCTransform, method="glmGamPoi",
+                  vars.to.regress=c("percent.mt", "batch", "orig.ident"))
Calculating cell attributes from input UMI matrix: log_umi
Variance stabilizing transformation of count matrix of size 24382 by 36417
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
  |======================================================================| 100%
Found 95 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 24382 genes
  |======================================================================| 100%
Computing corrected count matrix for 24382 genes
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 6.869874 mins
Determine variable features
Place corrected count matrix in counts slot
Regressing out percent.mt, batch, orig.ident
Centering data matrix
Set default assay to SCT
Calculating cell attributes from input UMI matrix: log_umi
Variance stabilizing transformation of count matrix of size 24462 by 43947
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
  |======================================================================| 100%
Found 76 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 24462 genes
  |======================================================================| 100%
Computing corrected count matrix for 24462 genes
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 7.829813 mins
Determine variable features
Place corrected count matrix in counts slot
Regressing out percent.mt, batch, orig.ident
Error: Failed to retrieve the result of MulticoreFuture (future_lapply-1) from the forked worker (on localhost; PID 1774). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive.
In addition: There were 50 or more warnings (use warnings() to see the first 50)

From warnings()

50: UNRELIABLE VALUE: Future (‘future_lapply-1’) unexpectedly generated random numbers without specifying argument 'future.seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse' to "ignore".

sessionInfo

R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.0     ifnb.SeuratData_3.1.0 SeuratData_0.2.1
[4] ggplot2_3.3.3         patchwork_1.1.1       SeuratObject_4.0.0
[7] Seurat_4.0.0          future_1.21.0

loaded via a namespace (and not attached):
  [1] Rtsne_0.15                  colorspace_2.0-0
  [3] deldir_0.2-10               ellipsis_0.3.1
  [5] ggridges_0.5.3              XVector_0.30.0
  [7] GenomicRanges_1.42.0        rstudioapi_0.13
  [9] spatstat.data_2.0-0         leiden_0.3.7
 [11] listenv_0.8.0               ggrepel_0.9.1
 [13] fansi_0.4.2                 codetools_0.2-18
 [15] splines_4.0.0               polyclip_1.10-0
 [17] jsonlite_1.7.2              ica_1.0-2
 [19] cluster_2.1.1               png_0.1-7
 [21] uwot_0.1.10                 shiny_1.6.0
 [23] sctransform_0.3.2           compiler_4.0.0
 [25] httr_1.4.2                  assertthat_0.2.1
 [27] Matrix_1.3-2                fastmap_1.1.0
 [29] lazyeval_0.2.2              cli_2.3.1
 [31] later_1.1.0.1               htmltools_0.5.1.1
 [33] tools_4.0.0                 igraph_1.2.6
 [35] gtable_0.3.0                glue_1.4.2
 [37] GenomeInfoDbData_1.2.4      RANN_2.6.1
 [39] reshape2_1.4.4              dplyr_1.0.5
 [41] rappdirs_0.3.3              Rcpp_1.0.6
 [43] spatstat_1.64-1             Biobase_2.50.0
 [45] scattermore_0.7             vctrs_0.3.6
 [47] debugme_1.1.0               nlme_3.1-152
 [49] lmtest_0.9-38               stringr_1.4.0
 [51] globals_0.14.0              ps_1.5.0
 [53] mime_0.10                   miniUI_0.1.1.1
 [55] lifecycle_1.0.0             irlba_2.3.3
 [57] goftest_1.2-2               zlibbioc_1.36.0
 [59] MASS_7.3-53.1               zoo_1.8-8
 [61] scales_1.1.1                promises_1.2.0.1
 [63] MatrixGenerics_1.2.1        spatstat.utils_2.0-0
 [65] parallel_4.0.0              SummarizedExperiment_1.20.0
 [67] RColorBrewer_1.1-2          reticulate_1.18
 [69] pbapply_1.4-3               gridExtra_2.3
 [71] rpart_4.1-15                stringi_1.5.3
 [73] S4Vectors_0.28.1            BiocGenerics_0.36.0
 [75] GenomeInfoDb_1.26.2         rlang_0.4.10
 [77] pkgconfig_2.0.3             matrixStats_0.58.0
 [79] bitops_1.0-6                lattice_0.20-41
 [81] glmGamPoi_1.2.0             ROCR_1.0-11
 [83] purrr_0.3.4                 tensor_1.5
 [85] htmlwidgets_1.5.3           cowplot_1.1.1
 [87] tidyselect_1.1.0            parallelly_1.23.0
 [89] RcppAnnoy_0.0.18            plyr_1.8.6
 [91] magrittr_2.0.1              R6_2.5.0
 [93] IRanges_2.24.1              generics_0.1.0
 [95] DelayedArray_0.16.2         DBI_1.1.1
 [97] pillar_1.5.1                withr_2.4.1
 [99] mgcv_1.8-34                 fitdistrplus_1.1-3
[101] survival_3.2-7              abind_1.4-5
[103] RCurl_1.98-1.2              tibble_3.1.0
[105] future.apply_1.7.0          crayon_1.4.1
[107] KernSmooth_2.23-18          utf8_1.1.4
[109] plotly_4.9.3                grid_4.0.0
[111] digest_0.6.27               xtable_1.8-4
[113] tidyr_1.1.3                 httpuv_1.5.5
[115] stats4_4.0.0                munsell_0.5.0
[117] viridisLite_0.3.0
ChristophH commented 3 years ago

You are right. In the sctransform package we use future_lapply without the future.seed parameters. I've just changed this in the develop branch. However, I don't think that's related to the error you are seeing. For a more informative error message, you might want to process the elements of your sc.list individually without multiprocessing.

rbutleriii commented 3 years ago

I was also out of memory, and did get it to run w/o errors by running it w/o future loaded (and more memory). The future.seed warnings also popped up for IntegrateData and FindClusters, but no errors w/ sufficient memory.

Worth noting this is big-ish batch analysis 175k cells and four groups; SCT + rPCA + reference. This can be closed w/ warnings handled, but perhaps a more informative postmortem message for killed worker? Though when I exited the interactive session on the cluster it gave me a memory error, so it will be more obvious to batch users & local users.

saketkc commented 3 years ago

Hi @rbutleriii, besides using the develop branch of sctransform where @ChristophH's has addressed the warning issue, you could set conserve.memory=TRUE to prevent creating the residual matrix (which is non-sparse) in full (would increase the overall run time). We'll try to address it by providing a more informative postmortem message in the future. Feel free to reopen with any follow-up issues.