plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
163 stars 17 forks source link

scDblFinder not deterministic when using batch+BPPARAM #53

Closed ATpoint closed 2 years ago

ATpoint commented 2 years ago

Hello,

I found that scDblFinder is not deterministic when running it with multiple batches and with the BPPARAM argument, producing quite different results. Below is a MWE, do you have any idea what is going on? Output is deterministic when either setting BPPARAM to SerialParam or removing the batch argument.


library(BiocParallel)
library(SingleCellExperiment)
library(scDblFinder)
library(parallel)
library(scRNAseq)
library(scran)

sce <- scRNAseq::LawlorPancreasData()

sce$batch <- factor(c(rep("A", 100), rep("B", 200), rep("C", 100), rep("D", 238)))

sce$cluster <- as.character(scran::quickCluster(sce))

k <- 
mclapply(1:3, mc.cores=3, function(x){

  set.seed(123)
  m <- 
    scDblFinder::scDblFinder(sce=sce, 
                             clusters=as.character(sce$cluster), 
                             samples=sce$batch,
                             BPPARAM=MulticoreParam(workers = 3))
  return(m$scDblFinder.score)

}); names(k) <- paste0("run_",1:length(k))

par(mfrow=c(2,2))
plot(k$run_1, k$run_2)
plot(k$run_1, k$run_3)
plot(k$run_2, k$run_3)

Screenshot 2021-12-07 at 22 27 41


R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 12.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] scran_1.18.3                scRNAseq_2.4.0              scDblFinder_1.4.0           SingleCellExperiment_1.12.0
 [5] SummarizedExperiment_1.20.0 Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
 [9] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
[13] matrixStats_0.57.0          BiocParallel_1.24.1        

loaded via a namespace (and not attached):
  [1] ggbeeswarm_0.6.0              colorspace_2.0-0              ellipsis_0.3.1                scuttle_1.0.4                
  [5] bluster_1.0.0                 XVector_0.30.0                BiocNeighbors_1.8.2           rstudioapi_0.13              
  [9] bit64_4.0.5                   interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0          fansi_0.4.2                  
 [13] xml2_1.3.2                    sparseMatrixStats_1.2.0       cachem_1.0.1                  scater_1.18.3                
 [17] Rsamtools_2.6.0               dbplyr_2.1.1                  shiny_1.6.0                   BiocManager_1.30.10          
 [21] compiler_4.0.3                httr_1.4.2                    dqrng_0.2.1                   lazyeval_0.2.2               
 [25] assertthat_0.2.1              Matrix_1.3-2                  fastmap_1.1.0                 limma_3.46.0                 
 [29] later_1.1.0.1                 BiocSingular_1.6.0            htmltools_0.5.1.1             prettyunits_1.1.1            
 [33] tools_4.0.3                   rsvd_1.0.3                    igraph_1.2.6                  gtable_0.3.0                 
 [37] glue_1.4.2                    GenomeInfoDbData_1.2.4        dplyr_1.0.5                   rappdirs_0.3.1               
 [41] Rcpp_1.0.7                    vctrs_0.3.6                   Biostrings_2.58.0             ExperimentHub_1.16.1         
 [45] rtracklayer_1.50.0            DelayedMatrixStats_1.12.2     stringr_1.4.0                 beachmat_2.6.4               
 [49] mime_0.9                      lifecycle_1.0.0               irlba_2.3.3                   ensembldb_2.14.0             
 [53] renv_0.13.2                   statmod_1.4.35                XML_3.99-0.5                  AnnotationHub_2.22.0         
 [57] edgeR_3.32.1                  zlibbioc_1.36.0               scales_1.1.1                  ProtGenerics_1.22.0          
 [61] hms_1.0.0                     promises_1.1.1                AnnotationFilter_1.14.0       yaml_2.2.1                   
 [65] curl_4.3                      memoise_2.0.0                 gridExtra_2.3                 ggplot2_3.3.5                
 [69] biomaRt_2.46.1                stringi_1.5.3                 RSQLite_2.2.3                 BiocVersion_3.12.0           
 [73] GenomicFeatures_1.42.1        rlang_0.4.12                  pkgconfig_2.0.3               bitops_1.0-6                 
 [77] lattice_0.20-41               purrr_0.3.4                   GenomicAlignments_1.26.0      bit_4.0.4                    
 [81] tidyselect_1.1.0              magrittr_2.0.1                R6_2.5.0                      generics_0.1.0               
 [85] DelayedArray_0.16.1           DBI_1.1.1                     withr_2.4.2                   pillar_1.6.0                 
 [89] RCurl_1.98-1.2                tibble_3.1.1                  crayon_1.4.1                  xgboost_1.3.2.1              
 [93] utf8_1.1.4                    BiocFileCache_1.14.0          viridis_0.5.1                 progress_1.2.2               
 [97] locfit_1.5-9.4                grid_4.0.3                    data.table_1.13.6             blob_1.2.1                   
[101] digest_0.6.27                 xtable_1.8-4                  httpuv_1.5.5                  openssl_1.4.3                
[105] munsell_0.5.0                 beeswarm_0.2.3                viridisLite_0.3.0             vipor_0.4.5                  
[109] askpass_1.1 
plger commented 2 years ago

Hi, the question of seeds+multithreading is discussed in the vignette -- have you tried using the RNGseed argument of MulticoreParam?

(Regarding the extent of variation across runs, this is chiefly because the batches are so small, it should decrease with more realistic batch sizes)

ATpoint commented 2 years ago

Thanks for the quick reply. It fact the RNGseed argument is not sufficient. When I run it on my own data it is still different between runs. Example below using my own data, same holds true for the example dataset I used in above.

> table(sce$batch, sce$cluster)

             C1   C2   C3   C4   C5   C6   C7   C8
  A_rep1 1224  694 1674  150   37  106  324  365
  A_rep2 1432 1046 2055  371   89  166  465  640
  B_rep1  1238  603  545    3  855  107  389 1188
  B_rep2   926  481  956    7  873   94  269  798
k <- 
  mclapply(1:3, mc.cores=3, function(x){

    set.seed(123)
    m <- 
      scDblFinder::scDblFinder(sce=sce, 
                               clusters=as.character(sce$cluster), 
                               samples=sce$batch,
                               BPPARAM=MulticoreParam(workers=2, RNGseed=123))
    return(m$scDblFinder.score)

  }); names(k) <- paste0("run_",1:length(k))

  par(mfrow=c(2,2))
  plot(k$run_1, k$run_2)
  plot(k$run_1, k$run_3)
  plot(k$run_2, k$run_3)

Screenshot 2021-12-08 at 11 11 32

For context why I bring this up (tl;dr), one might argue that the differences are minor. That might be true at first glance. Still, the workflow we used was to first make an initial clustering, then identify doublet clusters (there were some in our data) and then repeat clustering after excluding doublets. Rerunning this analysis now gives me slightly different clustering after doublet removal, and this leads to slightly different DEGs, hence slightly different hclust output and eventually slightly different gene sets. Might not be critical in an all-computational analysis but we based wetlab experiments on the exact gene sets so changes cannot easily be compensates for. Not a catastrophe, we still have the RDS files of the initial scDblFinder run, but it kind of is an anticlimax to now be forced to load those files in the workflow rather than having a "fully reproducible" Rmarkdown output. Would be great if this could be fixed in the next release :)

plger commented 2 years ago

I just notice now that your scDblFinder version is rather old. I just ran your exact code (with the LawlorPancreasData) with mine (scDblFinder 1.9.3) and got considerably more robust results (correlation of 0.988). Still, they're not identical, so it's pretty clear that the random seed isn't the same, and I totally agree with you that this is a problem for reproducible research.

On my setup, however, the problem appears to be with BiocParallel rather than scDblFinder, as:

> res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res[[1]],res[[2]])
FALSE
> res2 <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res,res2)
[1] FALSE

@LTLA any idea here? I remember this had been a BiocParallel problem in the past, but I know that when I wrote that part of the vignette it was working fine... (BiocParallel 1.26.2 here)

plger commented 2 years ago

(@LTLA false alarm, you can ignore this thread... although BiocParallel's behavior could be smoother :) )

okay now I just remembered: you have to do it exactly like in the vignette, e.g.:

bp <- MulticoreParam(2, RNGseed=123)
bpstart(bp)
m1 <- scDblFinder(sce, clusters="cluster", samples="batch", BPPARAM=bp)$scDblFinder.score
bpstop(bp)
bpstart(bp)
m2 <- scDblFinder(sce, clusters="cluster", samples="batch", BPPARAM=bp)$scDblFinder.score
bpstop(bp)

Then identical(m1,m2) is TRUE. The reason is that the seed is set upon cluster creation, and if you don't stop and create a new cluster, the seed will tick. So you have to make sure you're starting a new one (hence the explicit start stop).

ATpoint commented 2 years ago

Ah, indeed this works now and is identical between runs. Could (should?) indeed be smoother from the BiocParallel side.

k <- list()
for(i in 1:3){
  bp <- MulticoreParam(workers = 3, RNGseed=123)
  bpstart(bp)
  m <- 
    scDblFinder::scDblFinder(sce=sce, 
                             clusters=as.character(sce$cluster), 
                             samples=sce$batch,
                             BPPARAM=bp)
  bpstop(bp)
  k[[paste0("run_", i)]] <- m$scDblFinder.score
  rm(m, bp)
}
plger commented 2 years ago

great, and thanks for raising the issue, I've tried to make this more explicit in the vignette now.

ATpoint commented 2 years ago

Looks good. I just realized that I kept reading the vignette I had locally (1.4.0) so this is why I did not see the changes you made already in the newer vignettes towards bpstart(), sorry about that.

mtmorgan commented 2 years ago

One misunderstanding is

> res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res[[1]],res[[2]])
FALSE

FALSE is correct -- the random number seed is used for reproducibility of the bplapply() function, rather than repeated calls to FUN -- this

res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
res2 <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
identical(res, res2)

should (does) produce TRUE on current release / devel; you should have packageVersion("BiocParallel") greater than or equal to 1.28.0. You mention 1.26.0, which I guess means that you're not using the current release version of Bioconductor...

The use-case for bpstart(), in the context of reproducibility, is to allow multiple calls to bplapply() to be reproducible

p = bpstart(MulticoreParam(2, RNGseed = 123))
res <- list(
    bplapply(1:2, runif, BPPARAM = p), 
    bplapply(1:2, runif, BPPARAM = p)
)
identical(res[[1]], res[[2]])   # FALSE -- the random number generator has been advanced between calls to bplapply()

p = bpstart(MulticoreParam(2, RNGseed = 123))
res2 <- list(
    bplapply(1:2, runif, BPPARAM = p), 
    bplapply(1:2, runif, BPPARAM = p)
)
identical(res, res2)  # TRUE -- same RNGseed used for sequence of calls to `bplapply()`, so identical results

p = MulticoreParam(2, RNGseed = 123)
res <- list(
    bplapply(1:2, runif, BPPARAM = p), 
    bplapply(1:2, runif, BPPARAM = p)
)
identical(res[[1]], res[[2]])  # TRUE -- same unstarted param, so same seed used in each call
identical(res, res2)  # FALSE -- res is two calls with the same seed, res2 is two calls with the second seed different from the first

Automatically starting the param wouldn't allow for the second use case -- reproducible results from calls to bplapply().

In the sample code provided by the OP, I would expect

m <-scDblFinder::scDblFinder(sce=sce,
                               clusters=as.character(sce$cluster),
                               samples=sce$batch,
                               BPPARAM=MulticoreParam(workers=2, RNGseed=123))
m2 <-scDblFinder::scDblFinder(sce=sce,
                               clusters=as.character(sce$cluster),
                               samples=sce$batch,
                               BPPARAM=MulticoreParam(workers=2, RNGseed=123))

to produce identical results, which they do

> identical(m, m2)
[1] TRUE

So I guess the original poster is also not using the current release version of Bioconductor?? (yes, I see BiocVersion_3.12 in the output of sessionInfo()).

> BiocManager::version()
[1] '3.14'
> BiocManager::valid()
[1] TRUE   # or a message indicating how to make it valid

scDblFinder::scDblFinder makes multiple calls to bplapply(), resulting in a little unusual situation -- when RNGseed is provided in the *Param() constructor and the param has not been started, the same seed is used for each call to bplapply(). Likely the results are 'independent' because the code in each FUN is doing something quite different, but the results are not random. One could start scDblFinder::scDblFinder with something like

if (!bpisup(BPPARAM)) {
    bpstart(BPPARAM)
    on.exit(bpstop(BPPARAM))
}

to provide reproducibility and randomness in a way that is convenient to the user. If you make this change to scDblFinder(), then the results will change from the earlier version -- your user will lose reproducibility across versions, but actually they have already lost that because of underlying, necessary, changes in BiocParallel.

plger commented 2 years ago

Thanks a lot Martin for this explanation, this is very useful. Indeed I'm sadly still on 3.13 locally (rough fall), but glad to know this is smooth now! I think the multiple bplapply in scDblFinder isn't happening much (following the doc, users probably provide BPPARAM when using multiple batches, in which case the individual tasks within batches are not multithreaded), but I'll nevertheless make the changes you suggest -- it's simply better...