mlr-org / parallelMap

R package to interface some popular parallelization backends with a unified interface
https://parallelmap.mlr-org.com
Other
57 stars 14 forks source link

Proper default seeding for parallel processes #80

Closed pat-s closed 4 years ago

pat-s commented 4 years ago

fixes #43

Essentially, this PR ensures that parallel processes using the default RNG kind "Mersenne Twister" will be reproducible. This was not the case before and caused lots of confusion and SO Questions/GH issues.

Implementation

For mode "multicore", we do the following (the range for sample() is an arbitrary choice)

https://github.com/mlr-org/parallelMap/blob/9b38747c713c3a052030f5d85f97efce8a47ee3f/R/parallelMap.R#L113-L120

To honor the standard seed for subsequent calls, we need to restore it after the parallel call.

for mode "socket" it is a bit easier:

https://github.com/mlr-org/parallelMap/blob/9b38747c713c3a052030f5d85f97efce8a47ee3f/R/parallelStart.R#L194-L196

When setting reproducible = FALSE, things behave like before. By default, reproducible is set to TRUE.

If one uses the "L'Ecuyer-CMRG" manually, nothing happens and things will work as before.

Tests should cover all possible use cases and show the correct functionality.

Summary

This change will change model results for people who have used plain set.seed() in combination with parallelMap() before since we are now using the "L'Ecuyer-CMRG" RNG kind by default for parallel processes. This is a breaking change and we could consider bumping to v2.0.

pat-s commented 4 years ago

@mllg @berndbischl @jakob-r

If you find some time these days, I'd appreciate a review on this one.

codecov-io commented 4 years ago

Codecov Report

Merging #80 into master will increase coverage by 4.32%. The diff coverage is 86.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   80.91%   85.23%   +4.32%     
==========================================
  Files          20       20              
  Lines         592      623      +31     
==========================================
+ Hits          479      531      +52     
+ Misses        113       92      -21
Impacted Files Coverage Δ
R/makeMulticoreCluster.R 100% <ø> (+100%) :arrow_up:
R/zzz.R 0% <0%> (ø) :arrow_up:
R/getOption.R 96.49% <100%> (+0.33%) :arrow_up:
R/parallelShowOptions.R 82.75% <100%> (+1.27%) :arrow_up:
R/parallelMap.R 81.48% <100%> (+3.62%) :arrow_up:
R/parallelStart.R 85.41% <76.47%> (+5.18%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b92b843...74f985f. Read the comment docs.

jakob-r commented 4 years ago

It makes total sense to use the right RNG in the correct way. The question is: Why isn't this the default?

I would probably go so far to not have the reproducible argument and always do it the way that it is reproducible. I mean, this is what R implicitly wants to guarantee anyway. There is no reason to set reproducible = FALSE, or is there?

I discussed this also with @mllg but maybe @berndbischl also wants to have a word?

pat-s commented 4 years ago

It makes total sense to use the right RNG in the correct way. The question is:

Why isn't this the default?

You mean in R in general? Well idk. There are a bunch of weird design decision imo when it comes to seeding and parallelization. I discussed this with Henrik already. But such things are also hard to change now...


I would probably go so far to not have the reproducible argument and always do it the way that it is reproducible. I mean, this is what R implicitly wants to guarantee anyway. There is no reason to set reproducible = FALSE, or is there?

By having this as an explicit arg, we more prominently show that we do something on our side here. Also this can then be nicely reflected in a script.

I think there are use-cases (even though not many) where this should at least be up to the user and we should provide an options to turn it off.

Before merging this I'd also like to have a look from @berndbischl and @mllg.

jakob-r commented 4 years ago

But such things are also hard to change now...

Why? Everything parallel has not been reproducible before anyhow. So changing the default behavior cannot break something that is already broken.

By having this as an explicit arg, we more prominently show that we do something on our side here. Also this can then be nicely reflected in a script.

But you agree that the default should be TRUE, right? If somebody wants another seeding they can write it in their code that runs in parallel.

pat-s commented 4 years ago

Why? Everything parallel has not been reproducible before anyhow. So changing the default behavior cannot break something that is already broken.

Because of backward comp. R-core is very hesitant to make such changes. You always hear "things are documented and work (the way they do), its your thing to find out" :(

Idk, the {future} framework is the way to go anyways when it comes to parallelism. Sadly, its a lot easier to build a whole new framework than to makes (even small) changes to r-core. R 4.0 would be a good point in this case, so maybe we could even try to get some points to be changed to the better. So why not try and then "complain" ;) I just need more time... (but this should not be an excuse).

But you agree that the default should be TRUE, right? If somebody wants another seeding they can write it in their code that runs in parallel.

Oh yeah, the default should be reproducible = TRUE. I just would like to have this as a visible default arg so people will notice it directly in the help page - rather than doing it "silently" internally.

pat-s commented 4 years ago

Ping @berndbischl @mllg