Closed wlandau closed 1 year ago
The comment in https://github.com/wlandau/crew/issues/113#issuecomment-1706703123 describes a solid approach to guarantee statistical independence. However, it's tough for reproducibility in targets
. I want to assign a deterministic unique stream to each target, and I don't want a given target's choice of stream to interfere with the streams of other targets. What's more, dynamic branches are defined while the pipeline is running, so I cannot know all the target names in advance (or even how many there will be by the end).
(FYI @shikokuchuo)
Each target already gets a "seed", which is just an integer from digest::digest2int(). These digest2int()
values could be very large and/or very negative, but they are already deterministic and unique. I wonder if I can somehow map them deterministically to independent L'Ecuyer streams.
Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to parallel::nextRNGStream()
?
This is really tricky because L'Ecuyer RNG states are recursive, so a change in the seed of one target could invalidate a bunch of seemingly unrelated targets (causing them to rerun). So unless I can pin down independent RNG states deterministically, there is no way to do this in a way that preserves targets
' existing reproducibility guarantees. On the other hand, simulations with tarchetypes::tar_map_rep()
really need statistically independent random numbers for each batch of simulations.
Maybe L'Ecueyer's original paper has advice about how to manually generate seeds: https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159. Then maybe I could use targets
' existing digest2int()
number somehow instead of the default recursive behavior of nextRNGStream()
. It would be manual, but it may be the only way.
Just from a glance, I wouldn't trust myself to implement any part of https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159, or to manually advance streams an arbitrary number of steps.
I think targets
needs to commit to L'Ecuyer and give up some reproducibility in the process. Reproducibility does not mean anything if the results are incorrect to begin with.
What this will look like:
tar_make(use_crew = TRUE)
, tar_make_clustermq()
, or tar_make_future()
, you will be able to trust the statistical properties of the RNGs. targets
will no longer be able to guarantee that different runs of the pipeline will replicate the RNG state for each target exactly. In other words, you may get slightly different results because the pseudo-random numbers will be slightly different..Random.seed
vector instead of just a single integer for set.seed()
. This means you will still be able to recover the RNG state from the metadata in an interactive R session, whether with tar_workspace()
or manually.tarchetypes
accordingly (https://github.com/ropensci/tarchetypes/issues/111) so that the tar_seed
column of the output has the 7-integer .Random.seed
vector.Plan for the main implementation:
tar_runtime
object. Here in utils_callr.R
seems like the best place for that.produce_seed()
. Instead, the default seed in target$command$seed
should be NA.process_target()
, call parallel::nextRNGStream()
to get the next L'Ecuyer RNG state. Then assign it as the target's seed in target$command$seed
. Also update the state in the tar_runtme
object. If tar_option_get("seed")
is NA
, then do nothing.build_run_expr()
, if seed
is not NA, set the RNG algorithm to L'Ecuyer and set the seed to .GlobalEnv[[".Random.seed"]]
. No need to restore this seed on exit because (2) should take care of that. But do throw a warning if the RNGkind is not already L'Ecuyer.target$command$seed
in the metadata.tar_workspace()
, load and set the new seed properly. For compatibility, check for a 7-integer seed vs a 1-integer seed.tarchetypes
, discard the existing seed setting behavior. Just replicate (5) and set the seed vector to the tar_seed
column of various outputs, e.g. in tar_map_rep()
.Other tasks:
seed
argument of tar_cue()
.Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to
parallel::nextRNGStream()
?
Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.
Then when it is actually run, you initialise the seed RNGkind("L'Ecuyer-CMRG"); set.seed(<user or default value>)
, size of the hash table and pre-generate a list of all the required streams (random seeds) using parallel::nextRNGStream()
in a loop. Then for each target you simply retrieve the random seed to use by looking up the hash value.
Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.
If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to https://github.com/ropensci/targets/issues/1139#issuecomment-1707064940.
Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.
If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to #1139 (comment).
I'm suggesting that the hash map is stored as metadata. So assuming the target names are unique then it is a case of looking up the value, and if the key doesn't exist then creating one. That way each target name will always be associated with the same stream (as the table is immutable, but appendable).
I see. I do already plan to store the state of the stream for each target in the metadata, but I would rather not make guarantees about being able to recover it. Users can clear some or all of the metadata at any time.
Right. I think that's equivalent, but just storing the integer value may not only be more efficient but more reproducible, as if additional targets are added then they can be easily appended and the correct streams generated. Whereas if only the streams are stored, you would also need to separately mark the last-used stream in order to generate the next one etc. But just a suggestion based on your earlier observation - I don't profess to know all the intricacies involved.
FYI sections 6 and 7 of the parallel package vignette (on RNG and load-balancing) is a very clear and concise summary:
vignette("parallel", package = "parallel")
Thanks for the reference, I find it much clearer than the L'Ecuyer papers I skimmed yesterday.
From Section 6 ("Random-number generation"):
When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).
The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally plenty safe enough, but there have been worries that the random-number streams in the workers might somehow get into step.
From the above, it seems like targets
is not in a crisis like I initially feared. In fact, it actually has advantages over what most people seem to do. The set.seed()
seed of each target is the digest::digest2int()
of the target name. That means seeds are guaranteed to be:
(3) in particular cuts RNG sequences short, which makes overlap less likely.
On top of all this, yesterday I had a really tough time trying to use parallel::nextRNGStream()
in targets
(https://github.com/ropensci/targets/tree/1139). There are so many subtle things it breaks, chief of all reproducibility: the same code with the same set of target names no longer produces the same results.
In the short term, I think it would be much better if I write a whole chapter in https://books.ropensci.org/targets about pseudo-random number generation. The default behavior is worth explaining in depth. In addition, I can offer a workaround: set tar_option_set(seed = NA)
to prevent targets
from setting seeds altogether, then run targets::tar_make(use_crew = TRUE)
with crew
to get widely-spaced L'Ecuyer streams automatically.
And I wonder if I there is some way I can make it easy for targets
users to test their pipeline seed configuration. I am thinking:
For (2), I am wondering how best to simulate forward from the seed. (Maybe stats::runif()
? Maybe something even more basic?) And I am wondering how far forward to go. 100K steps? 10M steps? An empirical approach could give assurance to individual users for specific pipelines, and it could help understand how big a problem overlapping streams actually is in real life. For all I know, this problem may be extremely rare in practical workflows.
Agree, shouldn't be a 'crisis' situation for targets. The streams implementation in R was based on L'Ecuyer 2002 - so the 'state of the art' is still 20 years old!
Highlighting this topic in the book is a sensible approach - and it would help to disseminate the ideas - it is completely by luck that I found the (very helpful) vignette!
If you do end up developing testing, I just refer you to the end of the examples on ?RNG
:
sum(duplicated(runif(1e6))) # around 110 for default generator
as a good starting point.
I wrote https://books.ropensci.org/targets/random.html, was a long time coming anyway. Unless there is a non-sequential non-recursive way to generate widely-spaced seeds for base R, this is all I can do about the issue of statistical independence in targets
.
https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172 spells out the biggest reason why recursively-defined RNG streams are not feasible in targets
.
Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.
Just thinking through a bit more, it does get messy potentially, but you could assign each target a monotonically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream()
calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.
Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.
targets
is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets
does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets
looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to stick with the RNG state it chooses initially. With recursive L'Ecuyer streams, the RNG state would have depend on some kind of ordering. No matter ordering it is initially, it changes when the pipeline changes, and the effects on the RNG streams would cause tasks to rerun which should not need to rerun. Make sense?
Just thinking through a bit more, it does get messy potentially, but you could assign each target an atomically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream() calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.
As you mentioned here and in https://github.com/ropensci/targets/issues/1139#issuecomment-1707213462, this would require on the metadata for reproducibility, and that metadata could be deleted. In fact, it is common for users to start over from scratch with targets::tar_destroy()
and then expect reproducible results starting only with the code. So unfortunately that type of workaround is not feasible.
In fact, it is common for users to start over from scratch with
targets::tar_destroy()
and then expect reproducible results starting only with the code.
Thanks for the explanation @wlandau it does all make sense. What I was missing was the bit above. If targets is meant to be stateless then I would agree that RNG streams don't work.
What I was thinking of was effectively preserving the state after the first and each run. So you could literally repeat the experiment (with the recorded state) and therefore match the previous result. But precludes obviously starting from scratch as per the above.
@HenrikBengtsson, I think we can continue part of our discussion from https://github.com/DavisVaughan/furrr/issues/265 here. We were talking about why recursively defined L'Ecuyer streams are not feasible for targets
.
Regarding https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731993697: if only it were that simple.
A target is very different from an ordinary task because we need to know whether it is up to date. If a target's RNG stream changes from one run to the next, it is not up to date, which means the target needs to rerun. https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172 attempts to explain how that can happen in a non-sequential DAG.
Maybe your example from https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731993697 would be easier because it uses independent tasks. Say we start with 3 tasks. Here are the seeds:
RNGkind("L'Ecuyer-CMRG")
tasks <- c("task_a", "task_b", "task_c")
first_seed <- .Random.seed
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
str(seeds)
#> List of 3
#> $ task_a: int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#> $ task_b: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#> $ task_c: int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439
Now suppose we insert a new task between task_a
and task_b
. The RNG stream for task_a
stays the same, but task_b
and task_c
now have different streams. The new task gets the stream task_b
had last time, and task_c
gets the stream task_b
had last time.
tasks <- c("task_a", "task_new", "task_b", "task_c")
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
#> str(seeds)
#> List of 4
#> $ task_a : int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#> $ task_new: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#> $ task_b : int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439
#> $ task_c : int [1:7] 10407 -1556413327 -444875998 37935421 -1864914985 -341121962 -1229046830
In ordinary parallel computing, this is not a problem. You simply run the computation and note the seed that each task ran with. But in targets
, the change in RNG seeds of task_b
and tasks_c
is highly disruptive. These tasks should have nothing to do with any of the other tasks, and yet bizarrely, they need to rerun because task_new
was inserted. This is what I meant when I talked about inefficiency in https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172. It's not because of calling nextRNGStream()
, that's actually quite fast. The inefficiency comes from rerunning task_c
and task_d
.
In targets
, we should have a DAG with independent tasks. That should mean any change to task_a
or task_new
, or the insertion of task_new
, should not invalidate task_b
or task_c
. This is what the DAG we want:
graph LR
task_a
task_new
task_b
task_c
But if we were to use recursive L'Ecuyer streams, that would induce dependency relationships we don't want:
graph LR
task_a --> task_new
task_new --> task_b
task_b --> task_c
In terms of how targets
is used, here is how we would set up a pipeline with independent tasks (c.f. https://books.ropensci.org/targets/walkthrough.html). First, we create a _targets.R
file, which is like a Makefile for data analysis tasks in R.
# _targets.R file:
library(targets)
list(
tar_target(task_a, runif(1)),
tar_target(task_b, runif(1)),
tar_target(task_c, runif(1))
)
We inspect the graph and correctly see that the tasks are independent (there are no edges).
# R console:
library(targets)
tar_mermaid()
graph LR
task_a
task_b
task_c
When we run the pipeline, the tasks run in topological order (with optional parallelism using integration with crew
):
# R console
tar_make()
#> ▶ start target task_a
#> ● built target task_a [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]
targets
stores the result of the task and the RNG seeds it used.
# R console
tar_read(task_c)
#> [1] 0.3329109
tar_meta(task_c, seed)
#> # A tibble: 1 × 2
#> name seed
#> <chr> <int>
#> 1 task_c -439103021#>
Next time we run the pipeline, all our tasks are up to date. That's good because nothing changed since the last run.
# R console:
tar_make()
#> ✔ skip target task_a
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ✔ skip pipeline [0.593 seconds]
To insert a new task, we modify _targets.R
:
# _targets.R file
library(targets)
list(
tar_target(task_a, runif(1)),
tar_target(task_a_new, runif(1)),
tar_target(task_b, runif(1)),
tar_target(task_c, runif(1))
)
The new task task_a_new
should be totally independent of the other tasks because it does not use the data from those tasks. This is the graph we see:
# R console
tar_mermaid()
graph LR
task_a
task_a_new
task_b
task_c
When we run the pipeline, only task_a_new
runs because the others are up to date. That's good.
# R console:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ▶ end pipeline [0.458 seconds]
However: if targets
were to assign recursive L'Ecuyer streams in topological order, the insertion of task_a_new
would have changed the RNG streams of task_b
and task_c
. We would have had to rerun task_b
and task_c
. This not only causes a lot more computation for the user, it also obscures the true dependency relationships among the targets in the pipeline, which decreases the user's high-level understanding of the pipeline as a whole.
# What *would* have happened with recursive L'Ecuyer streams for each target:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]
A workaround @shikokuchuo proposed in https://github.com/ropensci/targets/issues/1139#issuecomment-1731921850 is to cache the RNG seed of a target the first time it is assigned, then not modify it. The problem is that users often delete metadata and begin again with only the code. In other words, targets
needs seeds to be reproducible and stateless, fixed in place using only the information given in the R code.
Advancing this conversation, I believe the desire for targets to be the 'GNU make for R' is somewhat of a red herring. I just note here that (i) GNU parallel make suffers its own problems and (ii) it is not expected to be doing statistical analyses in these parallel processes where the statistical validity is paramount.
If we were to be dogmatic that targets needs to be stateless in the way that GNU make (algorithm) is stateless (as far as I'm aware), then I would agree that this seems not to be reconcilable with recursively defined L'Ecuyer RNG streams.
However, from a practical viewpoint, I think we agree that the order of how streams are assigned is not important. Hence the proposal you mention above will guarantee that any result is reproducible ex post, having recorded the seeds used. This I imagine is important if results need to be verifiable (by re-running the computation) by a regulator etc.
The issue then is if the code/data is repeated on another machine without access to the seeds then different ones may be assigned. However, from a statistical perspective, we should be indifferent here. Ex ante, any statistic of interest, conditioned on the data should be identical to that conditioned on the data and a set of random seeds. Put another way, the seeds should have no relevant information content. That surely is the very definition of a PRNG with desirable properties, and if that's not the case then whether we use widely-spaced seeds seems moot.
Hence, even if we can't guarantee the exact same results, the statistical validity of the 2 sets of results can be guaranteed. If some finding of statistical significance under one set of seeds does not hold under another then that in itself is problematic, is it not? Just to note, I am speaking as an economist rather than a statistician, so if some of my reasoning above is off, do feel free to point out!
From another angle, the actual seeds used in targets are generated based on the target names (hashed to int, I believe), but say we replicated the code and data on a different machine, but just named the targets differently. Sure the seeds assigned will be different, and any random draws will be different, but they would both be the same experiment and both sets of results would be equally valid. Ex ante, I should have no reason to prefer one over the other.
In the same way, if you run targets recording the seeds used, and make changes along the way, the history of runs can affect the actual seeds supplied, but there should be no reason to prefer an arbitrarily defined set of seeds (by order in a DAG etc.).
From yet another angle, you could define targets as requiring a set of seeds (pre-generating them as per Henrik), and then it would just be an input of the algorithm rather than it 'having state' per se, although this seems more just semantics. It seems more a question of explaining the above and the trade-offs involved.
But I do believe reproducibility would be preserved (in the most meaningful way), and statistical validity at the same time.
You make really good points.
I can see this is going to haunt me until targets
adopts recursively generated L'Ecuyer streams. I can see no other choice, so I am going to try again to implement it. But it will be disruptive for users, painful to implement, and an incredible loss for the package.
Other comments:
?set.seed
says L'Ecuyer "...exhibits 6 clear failures in each of the TestU01 Crush and the BigCrush suite (L'Ecuyer, 2007)." Can we trust an RNG algorithm for parallel workflows which we can't even trust in sequential ones?vignette("parallel", package = "parallel")
says: "The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally safe enough, but there have been worries that the random-number streams in the workers might somehow get into step." It does not sound like the set.seed()
approach is totally invalid or catastrophic, and "worries" sound like speculation to me. How often do you actually get overlapping streams in real life?nextRNGStream()
is to advance the RNG far enough to decrease the chance of overlap. But every RNG is periodic, so after enough calls to nextRNGStream()
, wouldn't the n
'th "next stream" eventually land us back where we started? Figure 1 of https://pubsonline.informs.org/doi/10.1287/opre.50.6.1073.358:Using @shikokuchuo's tip about empirically assessing duplication, here is how an example targets
pipeline currently compares to an example L'Ecuyer workflow. It doesn't rule out the possibility of bias altogether, but it can show how much extra bias is in a targets
pipeline due to the naive set.seed() approach:
RNGkind("Mersenne-Twister")
library(targets)
tar_script({
list(
tar_target(x, seq_len(99L)),
tar_target(y, x, pattern = map(x))
)
})
tar_make()
seeds <- tar_meta(targets_only = TRUE)$seed
draws <- purrr::map(seeds, \(seed) withr::with_seed(seed, runif(1e6)))
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.165394
RNGkind("L'Ecuyer-CMRG")
draws <- list()
seed <- .Random.seed
for (i in seq_len(100)) {
seed <- parallel::nextRNGStream(seed)
.Random.seed <- seed
draws[[i]] <- runif(1e6)
}
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.154689
Update: I expanded https://books.ropensci.org/targets/random.html#measuring-independence to include comparisons like this one.
Responding to some specific comments:
Hence, even if we can't guarantee the exact same results, the statistical validity of the 2 sets of results can be guaranteed. If some finding of statistical significance under one set of seeds does not hold under another then that in itself is problematic, is it not? Just to note, I am speaking as an economist rather than a statistician, so if some of my reasoning above is off, do feel free to point out!
It is still valuable to recapture and investigate anomalies, especially from a regulatory perspective. And this is much harder to do if reproducibility is only ex post. From the perspective of targets
, that is the sort of damage that would come with recursive L'Ecuyer streams.
From another angle, the actual seeds used in targets are generated based on the target names (hashed to int, I believe), but say we replicated the code and data on a different machine, but just named the targets differently. Sure the seeds assigned will be different, and any random draws will be different, but they would both be the same experiment and both sets of results would be equally valid. Ex ante, I should have no reason to prefer one over the other.
Renaming a target changes its seed, but that doesn't matter for the kind of ex ante reproducibility we want for targets
: if a regulator or journal reviewer is trying to reproduce a set of results, they will definitely use the original set of code and target names, but the ex post seeds may not be available if the original authors deleted the metadata.
In the same way, if you run targets recording the seeds used, and make changes along the way, the history of runs can affect the actual seeds supplied, but there should be no reason to prefer an arbitrarily defined set of seeds (by order in a DAG etc.).
That is why targets
supports a configurable global seed. tar_option_set(seed = 123)
will give a different set of reproducible results than tar_option_set(seed = 456)
. In addition, tar_option_set(seed = NA)
avoids setting seeds altogether.
Given https://github.com/ropensci/targets/issues/1139#issuecomment-1732570487, https://github.com/ropensci/targets/issues/1139#issuecomment-1732573814, and https://github.com/ropensci/targets/issues/1139#issuecomment-1732577944, I am going to pause implementation on this. From my perspective:
set.seed()
approach really is, especially in real life.targets
to recursive L'Ecuyer streams would cause significant harm to the package, as far as its ability to provide strong enough reproducibility. (https://github.com/ropensci/targets/commit/31b6733d333bd9bc07a7c3c029199bdaa55be9fc#commitcomment-128223702 summarizes the current state of https://github.com/ropensci/targets/tree/1139 and what would need to be done if continued).
I will reopen this issue if it becomes clear that targets
needs to change RNG strategies. A switch might or might not improve statistical validity, that part is unclear to me, but it would definitely hurt reproducibility.
In ordinary statistical workflows, you expect to set a seed in your main top-level R script, and you expect that that one seed will precisely determine everything that follows. Unlike the ordinary map/reduce calls of the future
ecosystem, this expectation would break in targets
with recursive L'Ecuyer streams.
I think pausing is sensible. Also just to be clear I'm not advocating for any change in targets, just thinking through the implications.
I'll just leave a couple of observations for when you come back to this:
It would not work to just leave the generated seeds as part of metadata that could be deleted. It has to be mandated that the seeds are retained. Implementation could range from documenting that results are not complete without a set of seeds, all the way to having targets programmatically write the seeds back into the script.
Either a pipeline has never been run in which case there are no seeds, and any 2 people running will get the same results, or a pipeline has been run, in which case a set of seeds has been generated. If this is shared with the other party then they can also get the same results. If after running a pipeline, the seeds are purposefully deleted, then we are back to the first case and running without seeds will still get us the same results for both parties. Publishing a set of results without the seeds would be incomplete and not reproducible, and prohibited as per (i).
Requiring the seeds explicitly just removes the dependency, nothing else. As an example: in current targets, the seeds depend on the target names. A legitimate reason for changing the names may be as simple as wishing to correct a typo. In this case this will cause a re-run even though this is not necessary. If the seed for each target were supplied then this would not have to be the case. Similarly, if using recursive L'Ecuyer streams, there is a dependency now on the ordering rather than the names. Explicitly requiring the seeds just removes this dependency.
Thanks, @shikokuchuo. I like how you laid all that out.
A legitimate reason for changing the names may be as simple as wishing to correct a typo. In this case this will cause a re-run even though this is not necessary.
For this reason, in drake
I tried and failed to implement a way to programmatically rename a target without changing its seed or causing it to rerun. It turned out to be hopelessly messy and infeasible for both myself and users.
On this point, I will also add that target-name-based seeds guarantee uniqueness (aside from the vanishingly small chance of hash collisions in digest::digest2int()
), which is already more robust than the PID + clocktime method that most of the rest of the world currently does. Also, a user trying to recapture or preserve an anomaly will not likely be renaming the affected targets. And if they follow good programming practices, the code will be under version control anyway, which means the RNG state can be easily recovered from the commit history.
In this case this will cause a re-run even though this is not necessary.
In targets
, this would be the true regardless of the seed policy, and my exploration of this issue was part of the drake
experiment I mention above.
If the seed for each target were supplied then this would not have to be the case.
If the seed were independent of the target, then yes, after the target reruns the result would be the same.
Similarly, if using recursive L'Ecuyer streams, there is a dependency now on the ordering rather than the names. Explicitly requiring the seeds just removes this dependency.
I have resisted adding a seed
argument to tar_target()
which would allow this (or with some modification, maybe even require it). I have tried to shield users from this burden because as we have found it is a surprisingly hard topic. But anyone is free to manually call set.seed()
within an individual task.
I will add: It looks like parallel::nextRNGStream()
advances the state by 2^127
steps, and parallel::nextRNGSubStream()
advances it by 2^76
steps. If it is somehow possible, efficient, and helpful to advance a fixed global seed by 2^127 * digest::digest2int(x = "target_name", seed = targets_global_seed)
, this would break the recursion and be a much more feasible proposition for targets
. But I suspect this just adds more epicycles without achieving the same level of parallel safety as recursive streams.
Using @shikokuchuo's tip about empirically assessing duplication, here is how an example
targets
pipeline currently compares to an example L'Ecuyer workflow. It doesn't rule out the possibility of bias altogether, but it can show how much extra bias is in atargets
pipeline due to the naive set.seed() approach:RNGkind("Mersenne-Twister") library(targets) tar_script({ list( tar_target(x, seq_len(99L)), tar_target(y, x, pattern = map(x)) ) }) tar_make() seeds <- tar_meta(targets_only = TRUE)$seed draws <- purrr::map(seeds, \(seed) withr::with_seed(seed, runif(1e6))) 100 * mean(duplicated(unlist(draws))) # percent duplicated #> [1] 1.165394 RNGkind("L'Ecuyer-CMRG") draws <- list() seed <- .Random.seed for (i in seq_len(100)) { seed <- parallel::nextRNGStream(seed) .Random.seed <- seed draws[[i]] <- runif(1e6) } 100 * mean(duplicated(unlist(draws))) # percent duplicated #> [1] 1.154689
Update: I expanded https://books.ropensci.org/targets/random.html#measuring-independence to include comparisons like this one.
Just a note on this before I forget - I think you read too much into my suggestion, or you are attributing too much to me. I don't in fact think it works to try to measure this empirically in this way - it's a binary event either there is overlap or there isn't. I don't think we're capturing anything but noise in the above.
2.
vignette("parallel", package = "parallel")
says: "The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally safe enough, but there have been worries that the random-number streams in the workers might somehow get into step." It does not sound like theset.seed()
approach is totally invalid or catastrophic, and "worries" sound like speculation to me. How often do you actually get overlapping streams in real life?
This "generally safe enough" pre-supposes that the seeds set are themselves randomly-generated. See https://www.sciencedirect.com/science/article/pii/S0378475416300829 L'Ecuyer 2017.
Quote: "A single RNG with a “random” seed for each stream. When jumping ahead is too expensive and the RNG has a huge period, it can be reasonable to just select a (pseudo)random seed for each stream, using another RNG if we want reproducibility. "
Hence I think you'd need to pair digest::hash2int
with another RNG to generate the seed. Or replacing with a cryptographic hash function could work - I'm sure some experts might disagree, but it would seem to be equivalent.
Just a note on this before I forget - I think you read too much into my suggestion, or you are attributing too much to me.
Sorry about that. NB in https://books.ropensci.org/targets/random.html I make no attributions except to R core and L'Ecuyer.
See https://www.sciencedirect.com/science/article/pii/S0378475416300829 L'Ecuyer 2017.
Thanks for the reference. It seems like the authors understand the kind of challenge targets
faces:
There are situations where using one or more central monitor(s) to create the streams is not acceptable, e.g., because new streams have to be created dynamically and randomly within threads as a simulation goes on, without resorting to external information, and with a guarantee that the results are exactly reproducible [14,53]. In this setting, we want each stream to be able to create children streams by itself at any time, using a deterministic mechanism.
Hence I think you'd need to pair digest::hash2int with another RNG to generate the seed.
Unfortunately that's not feasible, it's the same issue as L'Ecuyer streams where they all come from the same "central monitor".
Or replacing with a cryptographic hash function could work
That's definitely feasible to implement for targets
. We would just need a nice way to map ordinary character hashes to 32-bit integers.
I'm sure some experts might disagree, but it would seem to be equivalent.
Would you elaborate? I am not sure I see the connection. A substantial justification would be extremely valuable.
Hence I think you'd need to pair digest::hash2int with another RNG to generate the seed.
Unfortunately that's not feasible, it's the same issue as L'Ecuyer streams where they all come from the same "central monitor".
I understood the paper to mean using the hash as the seed value to generate a random number (e.g. using MT), to then use as the L'Ecuyer seed. Everything is then deterministic and non-recursive (and it would seem feasible).
Or replacing with a cryptographic hash function could work
That's definitely feasible to implement for
targets
. We would just need a nice way to map ordinary character hashes to 32-bit integers.I'm sure some experts might disagree, but it would seem to be equivalent.
Would you elaborate? I am not sure I see the connection. A substantial justification would be extremely valuable.
The paper's 'possibly good enough' solution is to use a 'random' seed for each L'Ecuyer stream. Their suggestion is to use another RNG - so map from a deterministic seed into random space. Here you have a hash function hash2int
, but no assurance on its quality, so let's call the output 'correlated'. One solution would be to use an RNG to map this 'correlated' space into 'uncorrelated' or 'random' space. However a cryptographic hash, by definition, already maps into 'random' or 'uncorrelated' space. Why I anticipate potential disagreement (from statistical quarters) is that there is no guarantee on desirable statistical properties of a cryptographic hash, but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping.
Why I anticipate potential disagreement (from statistical quarters) is that there is no guarantee on desirable statistical properties of a cryptographic hash, but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping... but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping.
I agree: I took a closer look at the paper, and I don't think the statistical properties matter (i.e. approximating U(0, 1)) for the RNG that generates the seeds. In subsection "A single RNG with a 'random' seed for each stream" (Section 4: How to produce parallel streams and substreams"), when the authors argue that overlap is extremely unlikely, their argument only relies on the magnitude of the period, and it does not depend on the statistical uniform-ness of that upstream RNG.
So effectively, targets
takes this approach:
A. For the output function g(), use a hash algorithm. This defines a counter-based RNG. (Section 3: "Main classes of RNGs for simulation", last subsection: "counter-based RNGs").
B. For the counter i
in the counter-based RNG, use each target's own name instead of a recursive index.
C. Use g(i)
as the target seed (Section 4: How to produce parallel streams and substreams", subsection "A single RNG with a 'random' seed for each stream").
D. Within each target, use a multiple recursive generator (MRG) stream which starts with seed g(i)
.
Section 4 of the paper justifies (C) and (D).
For (B), the paper does not offer a rigorous justification. However, target name strings easily map to small integers, so I believe this is equivalent to taking small variable jumps ahead in the counter-based stream. As long as that counter-based RNG has a large period, the argument in Section 4 holds, and overlap is vanishingly unlikely in the long run.
(A) is the most challenging. We need the counter-based RNG to have a large period for the argument in Section 4 to hold. digest::digest2int()
uses Jenkins' one-at-a-time hash, which is not cryptographic, and it is unclear how large the period is for digest::digest2int(i)
as an RNG. Switching to a cryptographic hash may not be straightforward because:
set.seed()
only accepts an ordinary 32-bit integer. So I would either need to find a 32-bit cryptographic hash, which may not exist, or map it to 32-bit space (e.g. with modular arithmetic or another hash), which may invalidate the hash's original cryptographic properties. Cryptographic hashes like MD5 are more than 32 bits, and as far as I know, set.seed() only accepts an ordinary 32-bit integer. So I would either need to find a 32-bit cryptographic hash, which may not exist, or map it to 32-bit space (e.g. with modular arithmetic or another hash), which may invalidate the hash's original cryptographic properties.
I suppose digest::digest2int(cryptographic_hash(target_name))
is fine since it at least goes through a cryptographic hash, which increases the period even if it does not increase the number of possible output values.
Hmmm.... this might also slow down targets
a lot.
library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5"),
sha256 = digest(name, algo = "sha256"),
sha512 = digest(name, algo = "sha512"),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 507 730.5 1105.23 811.5 1069.0 17599 100
#> md5 17237 18389.0 133541.58 21435.0 24110.5 10555007 100
#> sha256 19013 20269.5 24911.84 22746.0 24862.5 152663 100
#> sha512 15633 17178.0 27703.10 19355.5 21550.5 406789 100
#> mt 11186 13200.5 30920.48 13951.5 16907.0 475125 100
Created on 2023-10-12 with reprex v2.0.2
I know the nanonext
hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.
We don't need serialization, and that makes things 2x faster.
library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5", serialize = FALSE),
sha256 = digest(name, algo = "sha256", serialize = FALSE),
sha512 = digest(name, algo = "sha512", serialize = FALSE),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 493 635.5 894.52 693.0 817.0 19359 100
#> md5 10345 10700.0 13325.78 10904.5 11450.5 234411 100
#> sha256 12306 12677.0 13282.54 13033.5 13580.5 29117 100
#> sha512 9043 9541.5 10088.13 9954.0 10444.5 13210 100
#> mt 11100 12095.5 12707.84 12509.0 12898.5 29362 100
name <- "longlonglonglonglonglonglonglonglonglonglonglonglonglonglonglong"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5", serialize = FALSE),
sha256 = digest(name, algo = "sha256", serialize = FALSE),
sha512 = digest(name, algo = "sha512", serialize = FALSE),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 558 681.0 777.75 747.5 807.5 3081 100
#> md5 10466 10792.5 11365.36 11285.0 11840.0 13044 100
#> sha256 12660 13078.5 13749.33 13440.0 14168.5 20253 100
#> sha512 9021 9592.5 10440.70 9834.5 10391.0 48685 100
#> mt 11662 12613.5 13318.57 13152.0 13564.0 33029 100
Created on 2023-10-12 with reprex v2.0.2
From profiling #1168, the performance hit is should be less than 1% in most pipelines, and it is not likely to increase above 3.5% even with cue = tar_cue(file = FALSE)
.
On a Mac it's a bit more overhead: 2% with tar_cue(file = TRUE), 5.8% with tar_cue(file = FALSE). But still not terrible and certainly not a bottleneck. The pipeline I was profiled was the one below (lots of quick small targets), and I ran proffer::pprof(tar_outdated(callr_function = NULL))
on an up-to-date pipeline.
library(targets)
library(tarchetypes)
tar_option_set(cue = tar_cue(file = FALSE))
list(
tar_target(x, 1:5000),
tar_target(ordinary_name, x, pattern = map(x))
)
I know the
nanonext
hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.
How so? Ubuntu focal has NNG 1.5.2 which supports nanonext
out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.
You do realise that nanonext::sha1()
is same order of magnitude as your original digest2int()
, so would be no discernible difference.
But it's very good you've implemented a 'correct' solution in any case.
Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.
How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.
A few months ago, @MilesMcBain reported having trouble with Ubuntu 18.04: https://github.com/ropensci/targets/discussions/1069. Even with the impetus to upgrade, I wonder how many users still run these older OS's.
You do realise that nanonext::sha1() is same order of magnitude as your original digest2int(), so would be no discernible difference.
Is SHA1 is strong enough for this?
Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.
I thought about it, but it feels like modular arithmetic, which is essentially cutting off the most significant bits in the hash value. I don't know what cryptographic properties we lose if we just take part of the hash, so I thought it would be more rigorous to supply the entire hash to digest2int()
instead. All this assumes set.seed()
only takes 32-bit integers. I will investigate that part.
Hmm... turns out seed
in set.seed()
needs to be a 32-bit integer. I will start an issue at https://github.com/HenrikBengtsson/Wishlist-for-R.
targets
has two major challenges with pseudo-random number generation:For years,
targets
has controlled (1) in a simplistic way: set a deterministic seed for each target, where each target seed depends on the target name and an overarching global seed:https://github.com/ropensci/targets/blob/116f1ced90e1945bc0a6888e79c8600d097f0e99/R/utils_digest.R#L35-L42
But this approach may violate (2). Different targets have different pseudo-random number generator states, and there is nothing to prevent the PRNG sequence of one target from overlapping with that of a different target.
This problem is not unique to
targets
, it is a general issue with parallel computing.C.f. https://github.com/wlandau/crew/issues/113.