plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
163 stars 17 forks source link

Batch mode with BPPARAM=MulticoreParam() not working #38

Closed JTpath closed 3 years ago

JTpath commented 3 years ago

Hi, not sure if this is an issue with scDblFinder, BiocParallel, or me, but this worked in the past but isn't working any more for some reason.

library(Seurat)
#> Attaching SeuratObject
library(scDblFinder)
library(BiocParallel)

l <- c(pbmc_small, pbmc_small)
l[[1]][["batch"]] = "A"
l[[2]][["batch"]] = "B"
seu <- merge(x=l[[1]], y=l[[2]])
#> Warning in CheckDuplicateCellNames(object.list = objects): Some cell names are
#> duplicated across objects provided. Renaming to enforce unique cell names.

sce <- as.SingleCellExperiment(seu)
out <- scDblFinder(sce, samples = "batch", BPPARAM=MulticoreParam(2))
#> Warning in parallel::mccollect(wait = FALSE, timeout = 1): 1 parallel job did
#> not deliver a result
#> Error in result[[njob]] <- value: attempt to select less than one element in OneIndex

Created on 2021-05-02 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() ─ Session info ───────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.0.5 (2021-03-31) os macOS Big Sur 10.16 system x86_64, darwin17.0 ui RStudio language (EN) collate en_GB.UTF-8 ctype en_GB.UTF-8 tz Europe/London date 2021-05-02 ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source abind 1.4-5 2016-07-21 [1] CRAN (R 4.0.2) assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) beachmat 2.6.4 2020-12-20 [1] Bioconductor beeswarm 0.3.1 2021-03-07 [1] CRAN (R 4.0.3) Biobase 2.50.0 2020-10-27 [1] Bioconductor BiocGenerics 0.36.1 2021-04-16 [1] Bioconductor BiocNeighbors 1.8.2 2020-12-07 [1] Bioconductor BiocParallel * 1.24.1 2020-11-06 [1] Bioconductor BiocSingular 1.6.0 2020-10-27 [1] Bioconductor bitops 1.0-7 2021-04-24 [1] CRAN (R 4.0.2) bluster 1.0.0 2020-10-27 [1] Bioconductor callr 3.7.0 2021-04-20 [1] CRAN (R 4.0.2) cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.2) clipr 0.7.1 2020-10-08 [1] CRAN (R 4.0.2) cluster 2.1.2 2021-04-17 [1] CRAN (R 4.0.2) codetools 0.2-18 2020-11-04 [1] CRAN (R 4.0.5) colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2) cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.0.2) crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.2) DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) DelayedArray 0.16.3 2021-03-24 [1] Bioconductor DelayedMatrixStats 1.12.3 2021-02-03 [1] Bioconductor deldir 0.2-10 2021-02-16 [1] CRAN (R 4.0.2) digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) dplyr 1.0.5 2021-03-05 [1] CRAN (R 4.0.2) dqrng 0.2.1 2019-05-17 [1] CRAN (R 4.0.2) edgeR 3.32.1 2021-01-14 [1] Bioconductor ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2) fitdistrplus 1.1-3 2020-12-05 [1] CRAN (R 4.0.2) fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) future 1.21.0 2020-12-10 [1] CRAN (R 4.0.3) future.apply 1.7.0 2021-01-04 [1] CRAN (R 4.0.2) generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2) GenomeInfoDb 1.26.7 2021-04-08 [1] Bioconductor GenomeInfoDbData 1.2.4 2021-01-17 [1] Bioconductor GenomicRanges 1.42.0 2020-10-27 [1] Bioconductor ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.2) ggplot2 3.3.3 2020-12-30 [1] CRAN (R 4.0.2) ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.0.2) ggridges 0.5.3 2021-01-08 [1] CRAN (R 4.0.2) globals 0.14.0 2020-11-22 [1] CRAN (R 4.0.2) glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) goftest 1.2-2 2019-12-02 [1] CRAN (R 4.0.2) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.2) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2) highr 0.9 2021-04-16 [1] CRAN (R 4.0.5) htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) htmlwidgets 1.5.3 2020-12-10 [1] CRAN (R 4.0.3) httpuv 1.6.0 2021-04-23 [1] CRAN (R 4.0.2) httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) ica 1.0-2 2018-05-24 [1] CRAN (R 4.0.2) igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2) IRanges 2.24.1 2020-12-12 [1] Bioconductor irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.2) jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2) KernSmooth 2.23-18 2020-10-29 [1] CRAN (R 4.0.5) knitr 1.33 2021-04-24 [1] CRAN (R 4.0.2) later 1.2.0 2021-04-23 [1] CRAN (R 4.0.2) lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.5) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.2) leiden 0.3.7 2021-01-26 [1] CRAN (R 4.0.3) lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) limma 3.46.0 2020-10-27 [1] Bioconductor listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.2) lmtest 0.9-38 2020-09-09 [1] CRAN (R 4.0.2) locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.2) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) MASS 7.3-53.1 2021-02-12 [1] CRAN (R 4.0.5) Matrix 1.3-2 2021-01-06 [1] CRAN (R 4.0.5) MatrixGenerics 1.2.1 2021-01-30 [1] Bioconductor matrixStats 0.58.0 2021-01-29 [1] CRAN (R 4.0.2) mgcv 1.8-35 2021-04-18 [1] CRAN (R 4.0.2) mime 0.10 2021-02-13 [1] CRAN (R 4.0.2) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.0.2) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2) nlme 3.1-152 2021-02-04 [1] CRAN (R 4.0.5) parallelly 1.24.0 2021-03-14 [1] CRAN (R 4.0.2) patchwork 1.1.1 2020-12-17 [1] CRAN (R 4.0.2) pbapply 1.4-3 2020-08-18 [1] CRAN (R 4.0.2) pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.5) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) plotly 4.9.3 2021-01-10 [1] CRAN (R 4.0.2) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) png 0.1-7 2013-12-03 [1] CRAN (R 4.0.2) polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.0.2) processx 3.5.1 2021-04-04 [1] CRAN (R 4.0.2) promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.0.2) ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.3) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) RANN 2.6.1 2019-01-08 [1] CRAN (R 4.0.2) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.2) Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2) RcppAnnoy 0.0.18 2020-12-15 [1] CRAN (R 4.0.2) RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.0.2) reprex * 2.0.0 2021-04-02 [1] CRAN (R 4.0.2) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.2) reticulate 1.19 2021-04-21 [1] CRAN (R 4.0.2) rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2) rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2) ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.0.2) rpart 4.1-15 2019-04-12 [1] CRAN (R 4.0.5) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2) rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.0.5) Rtsne 0.15 2018-11-10 [1] CRAN (R 4.0.2) S4Vectors 0.28.1 2020-12-09 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2) scater 1.18.6 2021-02-26 [1] Bioconductor scattermore 0.7 2020-11-24 [1] CRAN (R 4.0.2) scDblFinder * 1.5.16 2021-04-19 [1] Github (plger/scDblFinder@d11467b) scran 1.18.7 2021-04-16 [1] Bioconductor sctransform 0.3.2.9006 2021-04-01 [1] Github (ChristophH/sctransform@73e2e3e) scuttle 1.0.4 2020-12-17 [1] Bioconductor sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) Seurat * 4.0.1 2021-04-13 [1] Github (satijalab/seurat@4e868fc) SeuratObject * 4.0.0 2021-01-15 [1] CRAN (R 4.0.2) shiny 1.6.0 2021-01-25 [1] CRAN (R 4.0.3) SingleCellExperiment 1.12.0 2020-10-27 [1] Bioconductor sparseMatrixStats 1.2.1 2021-02-02 [1] Bioconductor spatstat.core 2.1-2 2021-04-18 [1] CRAN (R 4.0.2) spatstat.data 2.1-0 2021-03-21 [1] CRAN (R 4.0.3) spatstat.geom 2.1-0 2021-04-15 [1] CRAN (R 4.0.2) spatstat.sparse 2.0-0 2021-03-16 [1] CRAN (R 4.0.2) spatstat.utils 2.1-0 2021-03-15 [1] CRAN (R 4.0.2) statmod 1.4.35 2020-10-19 [1] CRAN (R 4.0.2) stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) SummarizedExperiment 1.20.0 2020-10-27 [1] Bioconductor survival 3.2-11 2021-04-26 [1] CRAN (R 4.0.2) tensor 1.5 2012-05-05 [1] CRAN (R 4.0.2) tibble 3.1.1 2021-04-18 [1] CRAN (R 4.0.2) tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.3) tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) uwot 0.1.10 2020-12-15 [1] CRAN (R 4.0.2) vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.2) vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.2) viridis 0.6.0 2021-04-15 [1] CRAN (R 4.0.5) viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.5) withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5) xfun 0.22 2021-03-11 [1] CRAN (R 4.0.2) xgboost 1.4.1.1 2021-04-22 [1] CRAN (R 4.0.2) xtable 1.8-4 2019-04-21 [1] CRAN (R 4.0.2) XVector 0.30.0 2020-10-28 [1] Bioconductor yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) zlibbioc 1.36.0 2020-10-28 [1] Bioconductor zoo 1.8-9 2021-03-09 [1] CRAN (R 4.0.2) [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```
plger commented 3 years ago

Hi, do you get the same error without the MulticoreParam argument?

JTpath commented 3 years ago

No, it even works if I set MulticoreParam to(1)

library(Seurat)
#> Registered S3 method overwritten by 'spatstat.geom':
#>   method     from
#>   print.boxx cli
#> Attaching SeuratObject
library(scDblFinder)
library(BiocParallel)
l <- c(pbmc_small, pbmc_small)
l[[1]][["batch"]] = "A"
l[[2]][["batch"]] = "B"
seu <- merge(x=l[[1]], y=l[[2]])
#> Warning in CheckDuplicateCellNames(object.list = objects): Some cell names are
#> duplicated across objects provided. Renaming to enforce unique cell names.
sce <- as.SingleCellExperiment(seu)
out <- scDblFinder(sce, samples = "batch", BPPARAM=MulticoreParam(1))
#> Warning in scDblFinder(sce[sel_features, x], clusters = clusters, dims = dims, :
#> scDblFinder might not work well with very low numbers of cells.
#> Warning in scDblFinder(sce[sel_features, x], clusters = clusters, dims = dims, :
#> scDblFinder might not work well with very low numbers of cells.

Created on 2021-05-02 by the reprex package (v2.0.0)

plger commented 3 years ago

Wow, thanks. Considering that one of your samples apparently has very few cells (which can cause problems), one possibility is that the job fails once in a while depending on the random seed, but if you tried several times and one always fails while the other always work, I think we can exclude that.

Could you try with:

MulticoreParam(2, log=TRUE, threshold="TRACE")

?

JTpath commented 3 years ago

Hi, the examples above use Seurat's built in pbmc_small dataset so they're reproducible, but yes they're tiny datasets.

I get the same error with my own data, and with a the larger publicly accessible 10x dataset (below) so don't think it's low cells. MulticoreParam(2, log=TRUE, threshold="TRACE") doesn't help sadly.

Thanks

  library(Seurat)
#> Attaching SeuratObject
  library(scDblFinder)
  library(BiocParallel)
  url <- "https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5"
  tmp_filename <- tempfile()
  download.file(url, tmp_filename)
  seu <- Read10X_h5(tmp_filename, use.names = TRUE)
#> Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
#> counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
  seu <- CreateSeuratObject(seu)
  seu$batch <- sample(c("rep1", "rep2"), size = ncol(seu), replace = TRUE)
  sce <- as.SingleCellExperiment(seu)
  out <- scDblFinder(sce, samples = "batch", BPPARAM=MulticoreParam(2, log=TRUE, threshold="TRACE"))
#> Warning in parallel::mccollect(wait = FALSE, timeout = 1): 1 parallel job did
#> not deliver a result
#> Error in result[[njob]] <- value: attempt to select less than one element in OneIndex

Created on 2021-05-02 by the reprex package (v2.0.0)

plger commented 3 years ago

hmm... I take it that BiocParallel works fine in other contexts?

JTpath commented 3 years ago

Hmm, think so in that it isn't erroring out but doesn't seem entirely happy.

Ran bplapply(1:6, sqrt, BPPARAM = MulticoreParam(3)) several times, all worked fine, then tried scDblFinder(sce, samples = "batch", BPPARAM=MulticoreParam(3)) and got the same error as above.

Ran bplapply(1:6, sqrt, BPPARAM = MulticoreParam(3)) straight after and got a warning the first time, subsequent tries were fine.

#> Warning message: In parallel::mccollect(wait = TRUE) : 1 parallel job did not deliver a result

LTLA commented 3 years ago

Chiming in here. The "1 parallel job did not deliver a result" warning (and subsequent indexing error) is usually observed when the child R process is killed by some kind of external system monitor. I often see this in Slurm jobs where the parallelization increases memory consumption to the point that the total usage exceeds the allocated memory, causing the out-of-memory killer to shut down the last offending process - usually the child. Because the R process is killed externally, R itself does not have a chance to give a reasonable error message back to the parent, leading to a rather cryptic set of warnings and errors.

It's rather unusual to see this on personal machines, as these tend to just go into swap and slow down rather than actively killing the process. Best bet is to try SnowParam() instead; this is slower but should avoid the difficulties associated with forking and shared memory.

plger commented 3 years ago

Thanks Aaron, that's very useful. If this is the cause then it would also explain that it's not occurring with toy examples...

JTpath commented 3 years ago

Thanks, it does seem to be a local problem, ran the same code on our HPC and it's fine. Not sure what I've done to upset MulticoreParam though.

I ran microbenchmark on my Mac using the PBMC_small example above,

res <- list(
    "SnowParam(2)" = microbenchmark(scDblFinder(sce, samples="batch", BPPARAM=SnowParam(2)), times = 10), 
    "lapply" = microbenchmark(scDblFinder(sce, samples="batch"), times = 10))

Results in seconds below, isn't any better with larger datasets. Have I done something wrong with snowparam?

       SnowParam(2)  lapply      
min    13.3744       3.415405
lq     13.71814      3.505418
mean   13.84331      3.685679
median 13.80867      3.620882
uq     13.86585      3.91428 
max    14.58285      4.106512
neval  10            10      

Simiar results on our HPC, but with an improvement for Multicore

       Multicore(2) SnowParam(2) lapply  
min    2.792165     10.64098     4.113479
lq     2.887825     10.87092     4.173983
mean   2.89723      11.17549     4.546509
median 2.913186     11.12241     4.314554
uq     2.930263     11.31667     4.879823
max    2.974608     11.95432     5.372606
neval  10           10           10      

Anyway, think i'm quite far off topic now... doesn't look like it has anything to do with scDblFinder (which is awesome btw)

LTLA commented 3 years ago

Yes, Snow has high overhead - it's starting up a whole other R process from scratch! - and so is not worth using unless your dataset is decently big. Multicore is fast because the new process shares the same memory as the old one, but this can occasionally lead to weird things happening where both processes overestimate their amount of available memory.

JTpath commented 3 years ago

Good to know, thank you

JTpath commented 3 years ago

I think this is caused by using the Mac OS X accelerate framework BLAS rather than R BLAS. Switched back to the standard one and it's working fine now

plger commented 3 years ago

great, thanks for reporting back on this!

ChristopherMancuso commented 1 year ago

@JTpath I'm having the same problem you are. Can you explain a bit more what you did to solve this? Also, do you have any idea what caused MulticoreParam to stop working in the first place?

JTpath commented 1 year ago

Switched back to the standard R BLAS and it worked fine. See a discussion of BLAS here, https://csantill.github.io/RPerformanceWBLAS/ might be a bit out of date now.