Closed pat-s closed 4 years ago
@mllg @berndbischl @jakob-r
If you find some time these days, I'd appreciate a review on this one.
Merging #80 into master will increase coverage by
4.32%
. The diff coverage is86.04%
.
@@ Coverage Diff @@
## master #80 +/- ##
==========================================
+ Coverage 80.91% 85.23% +4.32%
==========================================
Files 20 20
Lines 592 623 +31
==========================================
+ Hits 479 531 +52
+ Misses 113 92 -21
Impacted Files | Coverage Δ | |
---|---|---|
R/makeMulticoreCluster.R | 100% <ø> (+100%) |
:arrow_up: |
R/zzz.R | 0% <0%> (ø) |
:arrow_up: |
R/getOption.R | 96.49% <100%> (+0.33%) |
:arrow_up: |
R/parallelShowOptions.R | 82.75% <100%> (+1.27%) |
:arrow_up: |
R/parallelMap.R | 81.48% <100%> (+3.62%) |
:arrow_up: |
R/parallelStart.R | 85.41% <76.47%> (+5.18%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b92b843...74f985f. Read the comment docs.
It makes total sense to use the right RNG in the correct way. The question is: Why isn't this the default?
I would probably go so far to not have the reproducible argument and always do it the way that it is reproducible. I mean, this is what R implicitly wants to guarantee anyway. There is no reason to set reproducible = FALSE
, or is there?
I discussed this also with @mllg but maybe @berndbischl also wants to have a word?
It makes total sense to use the right RNG in the correct way. The question is:
Why isn't this the default?
You mean in R in general? Well idk. There are a bunch of weird design decision imo when it comes to seeding and parallelization. I discussed this with Henrik already. But such things are also hard to change now...
I would probably go so far to not have the reproducible argument and always do it the way that it is reproducible. I mean, this is what R implicitly wants to guarantee anyway. There is no reason to set reproducible = FALSE, or is there?
By having this as an explicit arg, we more prominently show that we do something on our side here. Also this can then be nicely reflected in a script.
I think there are use-cases (even though not many) where this should at least be up to the user and we should provide an options to turn it off.
Before merging this I'd also like to have a look from @berndbischl and @mllg.
But such things are also hard to change now...
Why? Everything parallel has not been reproducible before anyhow. So changing the default behavior cannot break something that is already broken.
By having this as an explicit arg, we more prominently show that we do something on our side here. Also this can then be nicely reflected in a script.
But you agree that the default should be TRUE, right? If somebody wants another seeding they can write it in their code that runs in parallel.
Why? Everything parallel has not been reproducible before anyhow. So changing the default behavior cannot break something that is already broken.
Because of backward comp. R-core is very hesitant to make such changes. You always hear "things are documented and work (the way they do), its your thing to find out" :(
Idk, the {future} framework is the way to go anyways when it comes to parallelism. Sadly, its a lot easier to build a whole new framework than to makes (even small) changes to r-core. R 4.0 would be a good point in this case, so maybe we could even try to get some points to be changed to the better. So why not try and then "complain" ;) I just need more time... (but this should not be an excuse).
But you agree that the default should be TRUE, right? If somebody wants another seeding they can write it in their code that runs in parallel.
Oh yeah, the default should be reproducible = TRUE
. I just would like to have this as a visible default arg so people will notice it directly in the help page - rather than doing it "silently" internally.
Ping @berndbischl @mllg
fixes #43
Essentially, this PR ensures that parallel processes using the default RNG kind "Mersenne Twister" will be reproducible. This was not the case before and caused lots of confusion and SO Questions/GH issues.
Implementation
For mode "multicore", we do the following (the range for
sample()
is an arbitrary choice)https://github.com/mlr-org/parallelMap/blob/9b38747c713c3a052030f5d85f97efce8a47ee3f/R/parallelMap.R#L113-L120
To honor the standard seed for subsequent calls, we need to restore it after the parallel call.
for mode "socket" it is a bit easier:
https://github.com/mlr-org/parallelMap/blob/9b38747c713c3a052030f5d85f97efce8a47ee3f/R/parallelStart.R#L194-L196
When setting
reproducible = FALSE
, things behave like before. By default,reproducible
is set toTRUE
.If one uses the "L'Ecuyer-CMRG" manually, nothing happens and things will work as before.
Tests should cover all possible use cases and show the correct functionality.
Summary
This change will change model results for people who have used plain
set.seed()
in combination withparallelMap()
before since we are now using the "L'Ecuyer-CMRG" RNG kind by default for parallel processes. This is a breaking change and we could consider bumping to v2.0.