hemberg-lab / SC3

A tool for the unsupervised clustering of cells from single cell RNA-Seq experiments
http://bioconductor.org/packages/SC3
GNU General Public License v3.0
119 stars 55 forks source link

rewritten parallelism to BiocParallel package #104

Open gorgitko opened 3 years ago

gorgitko commented 3 years ago

Hello,

I have been looking at https://github.com/hemberg-lab/SC3/issues/83 and decided to rewrite the parallelism to the BiocParallel package. The reasons could be reviewed in the first bullet below. I have successfully tested those modifications on a relatively small dataset (1k PBMC) and also run devtools::check() resulting in zero errors and warnings.

wikiselev commented 3 years ago

Hi Jiri, many thanks for your request! It all looks awesome and I will try to merge it ASAP, depending on my availability. Thanks again!

wikiselev commented 3 years ago

Hi Jiri,

I just fetched your changes to my laptop and tried to run the vignette in a Terminal. When I run the first parallelisation it feels super slow, gets stuck at 67% and does not progress further:

> library(BiocParallel)
> BPPARAM <- SnowParam(workers = 4, type = "SOCK", RNGseed = 1, progressbar = TRUE)
> bpstart(BPPARAM)
> BiocParallel::register(BPPARAM)
> sce <- sc3(sce, ks = 2:4, biology = TRUE)
Setting SC3 parameters...
Calculating distances between the cells...
  |===============================================                       |  67%

It may be a problem with my laptop, but here is my sessionInfo():

> sessionInfo()
R Under development (unstable) (2020-09-24 r79253)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] BiocParallel_1.26.1         scater_1.20.1
 [3] ggplot2_3.3.5               scuttle_1.2.0
 [5] SC3_1.15.1                  SingleCellExperiment_1.14.1
 [7] SummarizedExperiment_1.22.0 Biobase_2.52.0
 [9] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1
[11] IRanges_2.26.0              S4Vectors_0.30.0
[13] BiocGenerics_0.38.0         MatrixGenerics_1.4.0
[15] matrixStats_0.59.0

loaded via a namespace (and not attached):
 [1] viridis_0.6.1             BiocSingular_1.8.1
 [3] viridisLite_0.4.0         DelayedMatrixStats_1.14.0
 [5] shiny_1.6.0               GenomeInfoDbData_1.2.6
 [7] vipor_0.4.5               robustbase_0.93-8
 [9] pillar_1.6.1              lattice_0.20-41
[11] glue_1.4.2                beachmat_2.8.0
[13] digest_0.6.27             RColorBrewer_1.1-2
[15] promises_1.2.0.1          XVector_0.32.0
[17] colorspace_2.0-2          cowplot_1.1.1
[19] htmltools_0.5.1.1         httpuv_1.6.1
[21] Matrix_1.3-4              pcaPP_1.9-74
[23] WriteXLS_6.3.0            pkgconfig_2.0.3
[25] pheatmap_1.0.12           zlibbioc_1.38.0
[27] purrr_0.3.4               xtable_1.8-4
[29] mvtnorm_1.1-2             snow_0.4-3
[31] scales_1.1.1              ScaledMatrix_1.0.0
[33] later_1.2.0               tibble_3.1.2
[35] proxy_0.4-26              farver_2.1.0
[37] generics_0.1.0            ellipsis_0.3.2
[39] withr_2.4.2               ROCR_1.0-11
[41] magrittr_2.0.1            crayon_1.4.1
[43] mime_0.11                 fansi_0.5.0
[45] class_7.3-17              beeswarm_0.4.0
[47] tools_4.1.0               lifecycle_1.0.0
[49] munsell_0.5.0             cluster_2.1.0
[51] DelayedArray_0.18.0       irlba_2.3.3
[53] compiler_4.1.0            e1071_1.7-7
[55] rsvd_1.0.5                rlang_0.4.11
[57] grid_4.1.0                RCurl_1.98-1.3
[59] BiocNeighbors_1.10.0      labeling_0.4.2
[61] bitops_1.0-7              gtable_0.3.0
[63] rrcov_1.5-5               R6_2.5.0
[65] gridExtra_2.3             dplyr_1.0.7
[67] fastmap_1.1.0             utf8_1.2.1
[69] ggbeeswarm_0.6.0          Rcpp_1.0.7
[71] vctrs_0.3.8               DEoptimR_1.0-9
[73] tidyselect_1.1.1          sparseMatrixStats_1.4.0

Do you know what the problem can be?

gorgitko commented 3 years ago

Hello Vladimir,

I was trying my changes on an HPC server (~80 cores, 1TB RAM) without any problems, so could the performance of your laptop be the problem? Even on that HPC using 32 workers, the whole calculation takes a considerable amount of time.

The calculation of distances between cells takes a lot of memory and during parallel processing, the amount of used memory is even higher (basically three times more). Could you try first sequential processing with BPPARAM <- SerialParam()? Also, what is the size (dimensionality) of your sce object? I might suspect your laptop to start using swap space when there isn't enough RAM, which obviously leads to a significant calculation speed drop.

Just for curiosity, if you would try the original implementation of parallelism using the same number of workers, it would finish flawlessly?

wikiselev commented 3 years ago

Hi Jiri, many thanks for your reply! I am on vacation until Friday, will try your suggestions then. Thanks!

gorgitko commented 3 years ago

Hi, any updates on this? Soon I will be publishing a package to the Bioconductor, which doesn't allow remote sources (so I can't put my fork to DESCRIPTION).

wikiselev commented 3 years ago

Very sorry! I've changed jobs and have absolutely no time to work on this right now. Probably the best is not to rely on me updating it soon... I hope to get to this at some point... Very sorry again.