Closed ATpoint closed 2 years ago
Hi,
the question of seeds+multithreading is discussed in the vignette -- have you tried using the RNGseed
argument of MulticoreParam
?
(Regarding the extent of variation across runs, this is chiefly because the batches are so small, it should decrease with more realistic batch sizes)
Thanks for the quick reply. It fact the RNGseed argument is not sufficient. When I run it on my own data it is still different between runs. Example below using my own data, same holds true for the example dataset I used in above.
> table(sce$batch, sce$cluster)
C1 C2 C3 C4 C5 C6 C7 C8
A_rep1 1224 694 1674 150 37 106 324 365
A_rep2 1432 1046 2055 371 89 166 465 640
B_rep1 1238 603 545 3 855 107 389 1188
B_rep2 926 481 956 7 873 94 269 798
k <-
mclapply(1:3, mc.cores=3, function(x){
set.seed(123)
m <-
scDblFinder::scDblFinder(sce=sce,
clusters=as.character(sce$cluster),
samples=sce$batch,
BPPARAM=MulticoreParam(workers=2, RNGseed=123))
return(m$scDblFinder.score)
}); names(k) <- paste0("run_",1:length(k))
par(mfrow=c(2,2))
plot(k$run_1, k$run_2)
plot(k$run_1, k$run_3)
plot(k$run_2, k$run_3)
For context why I bring this up (tl;dr), one might argue that the differences are minor. That might be true at first glance. Still, the workflow we used was to first make an initial clustering, then identify doublet clusters (there were some in our data) and then repeat clustering after excluding doublets. Rerunning this analysis now gives me slightly different clustering after doublet removal, and this leads to slightly different DEGs, hence slightly different hclust output and eventually slightly different gene sets. Might not be critical in an all-computational analysis but we based wetlab experiments on the exact gene sets so changes cannot easily be compensates for. Not a catastrophe, we still have the RDS files of the initial scDblFinder run, but it kind of is an anticlimax to now be forced to load those files in the workflow rather than having a "fully reproducible" Rmarkdown output. Would be great if this could be fixed in the next release :)
I just notice now that your scDblFinder version is rather old. I just ran your exact code (with the LawlorPancreasData
) with mine (scDblFinder 1.9.3) and got considerably more robust results (correlation of 0.988). Still, they're not identical, so it's pretty clear that the random seed isn't the same, and I totally agree with you that this is a problem for reproducible research.
On my setup, however, the problem appears to be with BiocParallel
rather than scDblFinder
, as:
> res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res[[1]],res[[2]])
FALSE
> res2 <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res,res2)
[1] FALSE
@LTLA any idea here? I remember this had been a BiocParallel
problem in the past, but I know that when I wrote that part of the vignette it was working fine... (BiocParallel 1.26.2 here)
(@LTLA false alarm, you can ignore this thread... although BiocParallel's behavior could be smoother :) )
okay now I just remembered: you have to do it exactly like in the vignette, e.g.:
bp <- MulticoreParam(2, RNGseed=123)
bpstart(bp)
m1 <- scDblFinder(sce, clusters="cluster", samples="batch", BPPARAM=bp)$scDblFinder.score
bpstop(bp)
bpstart(bp)
m2 <- scDblFinder(sce, clusters="cluster", samples="batch", BPPARAM=bp)$scDblFinder.score
bpstop(bp)
Then identical(m1,m2)
is TRUE. The reason is that the seed is set upon cluster creation, and if you don't stop and create a new cluster, the seed will tick. So you have to make sure you're starting a new one (hence the explicit start stop).
Ah, indeed this works now and is identical between runs. Could (should?) indeed be smoother from the BiocParallel side.
k <- list()
for(i in 1:3){
bp <- MulticoreParam(workers = 3, RNGseed=123)
bpstart(bp)
m <-
scDblFinder::scDblFinder(sce=sce,
clusters=as.character(sce$cluster),
samples=sce$batch,
BPPARAM=bp)
bpstop(bp)
k[[paste0("run_", i)]] <- m$scDblFinder.score
rm(m, bp)
}
great, and thanks for raising the issue, I've tried to make this more explicit in the vignette now.
Looks good. I just realized that I kept reading the vignette I had locally (1.4.0) so this is why I did not see the changes you made already in the newer vignettes towards bpstart()
, sorry about that.
One misunderstanding is
> res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
> identical(res[[1]],res[[2]])
FALSE
FALSE
is correct -- the random number seed is used for reproducibility of the bplapply()
function, rather than repeated calls to FUN
-- this
res <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
res2 <- bplapply(1:2, BPPARAM=MulticoreParam(2, RNGseed=123), FUN=function(x) runif(10))
identical(res, res2)
should (does) produce TRUE
on current release / devel; you should have packageVersion("BiocParallel")
greater than or equal to 1.28.0. You mention 1.26.0, which I guess means that you're not using the current release version of Bioconductor...
The use-case for bpstart()
, in the context of reproducibility, is to allow multiple calls to bplapply()
to be reproducible
p = bpstart(MulticoreParam(2, RNGseed = 123))
res <- list(
bplapply(1:2, runif, BPPARAM = p),
bplapply(1:2, runif, BPPARAM = p)
)
identical(res[[1]], res[[2]]) # FALSE -- the random number generator has been advanced between calls to bplapply()
p = bpstart(MulticoreParam(2, RNGseed = 123))
res2 <- list(
bplapply(1:2, runif, BPPARAM = p),
bplapply(1:2, runif, BPPARAM = p)
)
identical(res, res2) # TRUE -- same RNGseed used for sequence of calls to `bplapply()`, so identical results
p = MulticoreParam(2, RNGseed = 123)
res <- list(
bplapply(1:2, runif, BPPARAM = p),
bplapply(1:2, runif, BPPARAM = p)
)
identical(res[[1]], res[[2]]) # TRUE -- same unstarted param, so same seed used in each call
identical(res, res2) # FALSE -- res is two calls with the same seed, res2 is two calls with the second seed different from the first
Automatically starting the param wouldn't allow for the second use case -- reproducible results from calls to bplapply()
.
In the sample code provided by the OP, I would expect
m <-scDblFinder::scDblFinder(sce=sce,
clusters=as.character(sce$cluster),
samples=sce$batch,
BPPARAM=MulticoreParam(workers=2, RNGseed=123))
m2 <-scDblFinder::scDblFinder(sce=sce,
clusters=as.character(sce$cluster),
samples=sce$batch,
BPPARAM=MulticoreParam(workers=2, RNGseed=123))
to produce identical results, which they do
> identical(m, m2)
[1] TRUE
So I guess the original poster is also not using the current release version of Bioconductor?? (yes, I see BiocVersion_3.12 in the output of sessionInfo()
).
> BiocManager::version()
[1] '3.14'
> BiocManager::valid()
[1] TRUE # or a message indicating how to make it valid
scDblFinder::scDblFinder
makes multiple calls to bplapply()
, resulting in a little unusual situation -- when RNGseed
is provided in the *Param()
constructor and the param has not been started, the same seed is used for each call to bplapply()
. Likely the results are 'independent' because the code in each FUN is doing something quite different, but the results are not random. One could start scDblFinder::scDblFinder
with something like
if (!bpisup(BPPARAM)) {
bpstart(BPPARAM)
on.exit(bpstop(BPPARAM))
}
to provide reproducibility and randomness in a way that is convenient to the user. If you make this change to scDblFinder()
, then the results will change from the earlier version -- your user will lose reproducibility across versions, but actually they have already lost that because of underlying, necessary, changes in BiocParallel.
Thanks a lot Martin for this explanation, this is very useful. Indeed I'm sadly still on 3.13 locally (rough fall), but glad to know this is smooth now!
I think the multiple bplapply in scDblFinder
isn't happening much (following the doc, users probably provide BPPARAM when using multiple batches, in which case the individual tasks within batches are not multithreaded), but I'll nevertheless make the changes you suggest -- it's simply better...
Hello,
I found that scDblFinder is not deterministic when running it with multiple batches and with the BPPARAM argument, producing quite different results. Below is a MWE, do you have any idea what is going on? Output is deterministic when either setting BPPARAM to SerialParam or removing the batch argument.