psolymos / pbapply

Adding progress bar to '*apply' functions in R
https://peter.solymos.org/pbapply/
156 stars 6 forks source link

Futures: Parallel random number generation (RNG) #60

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

To prevent non-sound random numbers being produced when running in parallel, futureverse asks the developer to specify when their code needs the RNG. If not asked for, it'll still check to see if the RNG was used (i.e. .Random.seed) was updated. If it was, then a warning is produced.

Here is an example:

> library(pbapply)
> future::plan("multisession")
> y <- pblapply(1:2, FUN = rnorm, cl = "future")
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
Warning messages:
1: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1') unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore". 
2: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-2') unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore". 

To avoid this, a quick fix is for you could always pass future.seed = TRUE. That will set up a parallel RNG regardless of random numbers being generated or not. The downside is that it can be computationally expensive to do so. To give the developer the control, you'd have to introduce a new argument allowing the to control the future.seed argument to future_lapply() and likes. One way to do that without adding a new argument could be via attributes, e.g.

y <- pblapply(1:2, FUN = rnorm, cl = structure("future", future.seed = TRUE))
psolymos commented 1 year ago

I like the attribute for the cl argument, but it might be a bit alien for some users. How about adding it to pboptions()? I.e. have it unset (NULL) on load, but check for the existence of the future.seed option and use that value.

HenrikBengtsson commented 1 year ago

How about adding it to pboptions()?

This is something the developer should control in their code. I don't think it should be modifiable by the end-user via an option - that'll give different results depending on option, which probably is not what the developer intended.

psolymos commented 1 year ago

I see the distinction. If the user is calling pb*apply(..., cl = "future") they should be able to set it as attribute, but if this is being used as part of another package, it is baked in.

psolymos commented 1 year ago

One can pass the future.seed argument directly through ... because ?future.apply::future_lapply tells:

For future_*apply() functions and replicate(), any future.* arguments part of \dots are passed on to future_lapply() used internally.

See:

r$> y <- pblapply(1:2, FUN = rnorm, cl = "future")
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
Warning messages:
1: UNRELIABLE VALUE: One of the ‘future.apply’ iterations (‘future_lapply-1’) unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore". 
2: UNRELIABLE VALUE: One of the ‘future.apply’ iterations (‘future_lapply-2’) unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore". 

r$> y <- pblapply(1:2, FUN = rnorm, cl = "future", future.seed = TRUE)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
# no warnings

So developers can utilize this behaviour to set the future seed.

HenrikBengtsson commented 1 year ago

So developers can utilize this behaviour to set the future seed.

Good point. Yes, that looks like the cleanest solution. Then a rule of thumb can be to "pass any additional arguments to FUN immediately following the FUN argument, and any additional arguments to the the futureverse after cl = "future";

y <- pblapply(1:2, FUN = my_fcn, {additional my_fcn args}, cl = "future", {additional future args})