ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
940 stars 73 forks source link

Statistical independence of pseudo-random numbers #1139

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

targets has two major challenges with pseudo-random number generation:

  1. Statistical reproducibility. Different runs of the same pipeline must give the exact same results.
  2. Statistical independence. Pseudo-random numbers from different targets must be independent.

For years, targets has controlled (1) in a simplistic way: set a deterministic seed for each target, where each target seed depends on the target name and an overarching global seed:

https://github.com/ropensci/targets/blob/116f1ced90e1945bc0a6888e79c8600d097f0e99/R/utils_digest.R#L35-L42

But this approach may violate (2). Different targets have different pseudo-random number generator states, and there is nothing to prevent the PRNG sequence of one target from overlapping with that of a different target.

This problem is not unique to targets, it is a general issue with parallel computing.

C.f. https://github.com/wlandau/crew/issues/113.

wlandau commented 1 year ago

The comment in https://github.com/wlandau/crew/issues/113#issuecomment-1706703123 describes a solid approach to guarantee statistical independence. However, it's tough for reproducibility in targets. I want to assign a deterministic unique stream to each target, and I don't want a given target's choice of stream to interfere with the streams of other targets. What's more, dynamic branches are defined while the pipeline is running, so I cannot know all the target names in advance (or even how many there will be by the end).

wlandau commented 1 year ago

(FYI @shikokuchuo)

wlandau commented 1 year ago

Each target already gets a "seed", which is just an integer from digest::digest2int(). These digest2int() values could be very large and/or very negative, but they are already deterministic and unique. I wonder if I can somehow map them deterministically to independent L'Ecuyer streams.

wlandau commented 1 year ago

Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to parallel::nextRNGStream()?

wlandau commented 1 year ago

This is really tricky because L'Ecuyer RNG states are recursive, so a change in the seed of one target could invalidate a bunch of seemingly unrelated targets (causing them to rerun). So unless I can pin down independent RNG states deterministically, there is no way to do this in a way that preserves targets' existing reproducibility guarantees. On the other hand, simulations with tarchetypes::tar_map_rep() really need statistically independent random numbers for each batch of simulations.

wlandau commented 1 year ago

Maybe L'Ecueyer's original paper has advice about how to manually generate seeds: https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159. Then maybe I could use targets' existing digest2int() number somehow instead of the default recursive behavior of nextRNGStream(). It would be manual, but it may be the only way.

wlandau commented 1 year ago

Just from a glance, I wouldn't trust myself to implement any part of https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159, or to manually advance streams an arbitrary number of steps.

wlandau commented 1 year ago

I think targets needs to commit to L'Ecuyer and give up some reproducibility in the process. Reproducibility does not mean anything if the results are incorrect to begin with.

What this will look like:

Plan for the main implementation:

  1. Commit to L'Ecuyer as the RNG algorithm.
  2. At the start of the pipeline, initialize the L'Ecuyer state in the tar_runtime object. Here in utils_callr.R seems like the best place for that.
  3. Do not use produce_seed(). Instead, the default seed in target$command$seed should be NA.
  4. Here in process_target(), call parallel::nextRNGStream() to get the next L'Ecuyer RNG state. Then assign it as the target's seed in target$command$seed. Also update the state in the tar_runtme object. If tar_option_get("seed") is NA, then do nothing.
  5. In build_run_expr(), if seed is not NA, set the RNG algorithm to L'Ecuyer and set the seed to .GlobalEnv[[".Random.seed"]]. No need to restore this seed on exit because (2) should take care of that. But do throw a warning if the RNGkind is not already L'Ecuyer.
  6. Store the seed in target$command$seed in the metadata.
  7. In tar_workspace(), load and set the new seed properly. For compatibility, check for a 7-integer seed vs a 1-integer seed.
  8. In tarchetypes, discard the existing seed setting behavior. Just replicate (5) and set the seed vector to the tar_seed column of various outputs, e.g. in tar_map_rep().

Other tasks:

shikokuchuo commented 1 year ago

Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to parallel::nextRNGStream()?

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

Then when it is actually run, you initialise the seed RNGkind("L'Ecuyer-CMRG"); set.seed(<user or default value>), size of the hash table and pre-generate a list of all the required streams (random seeds) using parallel::nextRNGStream() in a loop. Then for each target you simply retrieve the random seed to use by looking up the hash value.

wlandau commented 1 year ago

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to https://github.com/ropensci/targets/issues/1139#issuecomment-1707064940.

shikokuchuo commented 1 year ago

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to #1139 (comment).

I'm suggesting that the hash map is stored as metadata. So assuming the target names are unique then it is a case of looking up the value, and if the key doesn't exist then creating one. That way each target name will always be associated with the same stream (as the table is immutable, but appendable).

wlandau commented 1 year ago

I see. I do already plan to store the state of the stream for each target in the metadata, but I would rather not make guarantees about being able to recover it. Users can clear some or all of the metadata at any time.

shikokuchuo commented 1 year ago

Right. I think that's equivalent, but just storing the integer value may not only be more efficient but more reproducible, as if additional targets are added then they can be easily appended and the correct streams generated. Whereas if only the streams are stored, you would also need to separately mark the last-used stream in order to generate the next one etc. But just a suggestion based on your earlier observation - I don't profess to know all the intricacies involved.

shikokuchuo commented 1 year ago

FYI sections 6 and 7 of the parallel package vignette (on RNG and load-balancing) is a very clear and concise summary:

vignette("parallel", package = "parallel")
wlandau commented 1 year ago

Thanks for the reference, I find it much clearer than the L'Ecuyer papers I skimmed yesterday.

From Section 6 ("Random-number generation"):

When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).

The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally plenty safe enough, but there have been worries that the random-number streams in the workers might somehow get into step.

From the above, it seems like targets is not in a crisis like I initially feared. In fact, it actually has advantages over what most people seem to do. The set.seed() seed of each target is the digest::digest2int() of the target name. That means seeds are guaranteed to be:

  1. Reproducible.
  2. Unique.
  3. Different for each task, not each worker.

(3) in particular cuts RNG sequences short, which makes overlap less likely.

On top of all this, yesterday I had a really tough time trying to use parallel::nextRNGStream() in targets (https://github.com/ropensci/targets/tree/1139). There are so many subtle things it breaks, chief of all reproducibility: the same code with the same set of target names no longer produces the same results.

In the short term, I think it would be much better if I write a whole chapter in https://books.ropensci.org/targets about pseudo-random number generation. The default behavior is worth explaining in depth. In addition, I can offer a workaround: set tar_option_set(seed = NA) to prevent targets from setting seeds altogether, then run targets::tar_make(use_crew = TRUE) with crew to get widely-spaced L'Ecuyer streams automatically.

And I wonder if I there is some way I can make it easy for targets users to test their pipeline seed configuration. I am thinking:

  1. Take the target seeds from the metadata.
  2. For each seed, run the RNG sequence with something really simple.
  3. Look for overlap.

For (2), I am wondering how best to simulate forward from the seed. (Maybe stats::runif()? Maybe something even more basic?) And I am wondering how far forward to go. 100K steps? 10M steps? An empirical approach could give assurance to individual users for specific pipelines, and it could help understand how big a problem overlapping streams actually is in real life. For all I know, this problem may be extremely rare in practical workflows.

shikokuchuo commented 1 year ago

Agree, shouldn't be a 'crisis' situation for targets. The streams implementation in R was based on L'Ecuyer 2002 - so the 'state of the art' is still 20 years old!

Highlighting this topic in the book is a sensible approach - and it would help to disseminate the ideas - it is completely by luck that I found the (very helpful) vignette!

If you do end up developing testing, I just refer you to the end of the examples on ?RNG:

sum(duplicated(runif(1e6))) # around 110 for default generator

as a good starting point.

wlandau commented 1 year ago

I wrote https://books.ropensci.org/targets/random.html, was a long time coming anyway. Unless there is a non-sequential non-recursive way to generate widely-spaced seeds for base R, this is all I can do about the issue of statistical independence in targets.

wlandau commented 1 year ago

https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172 spells out the biggest reason why recursively-defined RNG streams are not feasible in targets.

shikokuchuo commented 1 year ago

Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.

Just thinking through a bit more, it does get messy potentially, but you could assign each target a monotonically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream() calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.

wlandau commented 1 year ago

Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.

targets is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to stick with the RNG state it chooses initially. With recursive L'Ecuyer streams, the RNG state would have depend on some kind of ordering. No matter ordering it is initially, it changes when the pipeline changes, and the effects on the RNG streams would cause tasks to rerun which should not need to rerun. Make sense?

Just thinking through a bit more, it does get messy potentially, but you could assign each target an atomically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream() calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.

As you mentioned here and in https://github.com/ropensci/targets/issues/1139#issuecomment-1707213462, this would require on the metadata for reproducibility, and that metadata could be deleted. In fact, it is common for users to start over from scratch with targets::tar_destroy() and then expect reproducible results starting only with the code. So unfortunately that type of workaround is not feasible.

shikokuchuo commented 1 year ago

In fact, it is common for users to start over from scratch with targets::tar_destroy() and then expect reproducible results starting only with the code.

Thanks for the explanation @wlandau it does all make sense. What I was missing was the bit above. If targets is meant to be stateless then I would agree that RNG streams don't work.

What I was thinking of was effectively preserving the state after the first and each run. So you could literally repeat the experiment (with the recorded state) and therefore match the previous result. But precludes obviously starting from scratch as per the above.

wlandau commented 1 year ago

@HenrikBengtsson, I think we can continue part of our discussion from https://github.com/DavisVaughan/furrr/issues/265 here. We were talking about why recursively defined L'Ecuyer streams are not feasible for targets.

Regarding https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731993697: if only it were that simple.

A target is very different from an ordinary task because we need to know whether it is up to date. If a target's RNG stream changes from one run to the next, it is not up to date, which means the target needs to rerun. https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172 attempts to explain how that can happen in a non-sequential DAG.

Maybe your example from https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731993697 would be easier because it uses independent tasks. Say we start with 3 tasks. Here are the seeds:

RNGkind("L'Ecuyer-CMRG")
tasks <- c("task_a", "task_b", "task_c")
first_seed <- .Random.seed
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
  seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
str(seeds)
#> List of 3
#>  $ task_a: int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#>  $ task_b: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#>  $ task_c: int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439

Now suppose we insert a new task between task_a and task_b. The RNG stream for task_a stays the same, but task_b and task_c now have different streams. The new task gets the stream task_b had last time, and task_c gets the stream task_b had last time.

tasks <- c("task_a", "task_new", "task_b", "task_c")
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
  seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
#> str(seeds)
#> List of 4
#>  $ task_a  : int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#>  $ task_new: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#>  $ task_b  : int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439
#>  $ task_c  : int [1:7] 10407 -1556413327 -444875998 37935421 -1864914985 -341121962 -1229046830

In ordinary parallel computing, this is not a problem. You simply run the computation and note the seed that each task ran with. But in targets, the change in RNG seeds of task_b and tasks_c is highly disruptive. These tasks should have nothing to do with any of the other tasks, and yet bizarrely, they need to rerun because task_new was inserted. This is what I meant when I talked about inefficiency in https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172. It's not because of calling nextRNGStream(), that's actually quite fast. The inefficiency comes from rerunning task_c and task_d.

wlandau commented 1 year ago

In targets, we should have a DAG with independent tasks. That should mean any change to task_a or task_new, or the insertion of task_new, should not invalidate task_b or task_c. This is what the DAG we want:

graph LR
  task_a
  task_new
  task_b
  task_c

But if we were to use recursive L'Ecuyer streams, that would induce dependency relationships we don't want:

graph LR
  task_a --> task_new
  task_new --> task_b
  task_b --> task_c
wlandau commented 1 year ago

In terms of how targets is used, here is how we would set up a pipeline with independent tasks (c.f. https://books.ropensci.org/targets/walkthrough.html). First, we create a _targets.R file, which is like a Makefile for data analysis tasks in R.

# _targets.R file:
library(targets)
list(
  tar_target(task_a, runif(1)),
  tar_target(task_b, runif(1)),
  tar_target(task_c, runif(1))
)

We inspect the graph and correctly see that the tasks are independent (there are no edges).

# R console:
library(targets)
tar_mermaid()
graph LR
  task_a
  task_b
  task_c

When we run the pipeline, the tasks run in topological order (with optional parallelism using integration with crew):

# R console
tar_make()
#> ▶ start target task_a
#> ● built target task_a [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]

targets stores the result of the task and the RNG seeds it used.

# R console
tar_read(task_c)
#> [1] 0.3329109
tar_meta(task_c, seed)
#> # A tibble: 1 × 2
#>   name         seed
#>   <chr>       <int>
#> 1 task_c -439103021#> 

Next time we run the pipeline, all our tasks are up to date. That's good because nothing changed since the last run.

# R console:
tar_make()
#> ✔ skip target task_a
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ✔ skip pipeline [0.593 seconds]

To insert a new task, we modify _targets.R:

# _targets.R file
library(targets)
list(
  tar_target(task_a, runif(1)),
  tar_target(task_a_new, runif(1)), 
  tar_target(task_b, runif(1)),
  tar_target(task_c, runif(1))
)

The new task task_a_new should be totally independent of the other tasks because it does not use the data from those tasks. This is the graph we see:

# R console
tar_mermaid()
graph LR
  task_a
  task_a_new
  task_b
  task_c

When we run the pipeline, only task_a_new runs because the others are up to date. That's good.

# R console:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ▶ end pipeline [0.458 seconds]

However: if targets were to assign recursive L'Ecuyer streams in topological order, the insertion of task_a_new would have changed the RNG streams of task_b and task_c. We would have had to rerun task_b and task_c. This not only causes a lot more computation for the user, it also obscures the true dependency relationships among the targets in the pipeline, which decreases the user's high-level understanding of the pipeline as a whole.

# What *would* have happened with recursive L'Ecuyer streams for each target:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]
wlandau commented 1 year ago

A workaround @shikokuchuo proposed in https://github.com/ropensci/targets/issues/1139#issuecomment-1731921850 is to cache the RNG seed of a target the first time it is assigned, then not modify it. The problem is that users often delete metadata and begin again with only the code. In other words, targets needs seeds to be reproducible and stateless, fixed in place using only the information given in the R code.

shikokuchuo commented 1 year ago

Advancing this conversation, I believe the desire for targets to be the 'GNU make for R' is somewhat of a red herring. I just note here that (i) GNU parallel make suffers its own problems and (ii) it is not expected to be doing statistical analyses in these parallel processes where the statistical validity is paramount.

If we were to be dogmatic that targets needs to be stateless in the way that GNU make (algorithm) is stateless (as far as I'm aware), then I would agree that this seems not to be reconcilable with recursively defined L'Ecuyer RNG streams.

However, from a practical viewpoint, I think we agree that the order of how streams are assigned is not important. Hence the proposal you mention above will guarantee that any result is reproducible ex post, having recorded the seeds used. This I imagine is important if results need to be verifiable (by re-running the computation) by a regulator etc.

The issue then is if the code/data is repeated on another machine without access to the seeds then different ones may be assigned. However, from a statistical perspective, we should be indifferent here. Ex ante, any statistic of interest, conditioned on the data should be identical to that conditioned on the data and a set of random seeds. Put another way, the seeds should have no relevant information content. That surely is the very definition of a PRNG with desirable properties, and if that's not the case then whether we use widely-spaced seeds seems moot.

Hence, even if we can't guarantee the exact same results, the statistical validity of the 2 sets of results can be guaranteed. If some finding of statistical significance under one set of seeds does not hold under another then that in itself is problematic, is it not? Just to note, I am speaking as an economist rather than a statistician, so if some of my reasoning above is off, do feel free to point out!

From another angle, the actual seeds used in targets are generated based on the target names (hashed to int, I believe), but say we replicated the code and data on a different machine, but just named the targets differently. Sure the seeds assigned will be different, and any random draws will be different, but they would both be the same experiment and both sets of results would be equally valid. Ex ante, I should have no reason to prefer one over the other.

In the same way, if you run targets recording the seeds used, and make changes along the way, the history of runs can affect the actual seeds supplied, but there should be no reason to prefer an arbitrarily defined set of seeds (by order in a DAG etc.).

From yet another angle, you could define targets as requiring a set of seeds (pre-generating them as per Henrik), and then it would just be an input of the algorithm rather than it 'having state' per se, although this seems more just semantics. It seems more a question of explaining the above and the trade-offs involved.

But I do believe reproducibility would be preserved (in the most meaningful way), and statistical validity at the same time.

wlandau commented 1 year ago

You make really good points.

I can see this is going to haunt me until targets adopts recursively generated L'Ecuyer streams. I can see no other choice, so I am going to try again to implement it. But it will be disruptive for users, painful to implement, and an incredible loss for the package.

wlandau commented 1 year ago

Other comments:

  1. ?set.seed says L'Ecuyer "...exhibits 6 clear failures in each of the TestU01 Crush and the BigCrush suite (L'Ecuyer, 2007)." Can we trust an RNG algorithm for parallel workflows which we can't even trust in sequential ones?
  2. vignette("parallel", package = "parallel") says: "The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally safe enough, but there have been worries that the random-number streams in the workers might somehow get into step." It does not sound like the set.seed() approach is totally invalid or catastrophic, and "worries" sound like speculation to me. How often do you actually get overlapping streams in real life?
  3. The idea of nextRNGStream() is to advance the RNG far enough to decrease the chance of overlap. But every RNG is periodic, so after enough calls to nextRNGStream(), wouldn't the n'th "next stream" eventually land us back where we started? Figure 1 of https://pubsonline.informs.org/doi/10.1287/opre.50.6.1073.358:

wheel

wlandau commented 1 year ago

Using @shikokuchuo's tip about empirically assessing duplication, here is how an example targets pipeline currently compares to an example L'Ecuyer workflow. It doesn't rule out the possibility of bias altogether, but it can show how much extra bias is in a targets pipeline due to the naive set.seed() approach:

RNGkind("Mersenne-Twister")
library(targets)
tar_script({
  list(
    tar_target(x, seq_len(99L)),
    tar_target(y, x, pattern = map(x))
  )
})
tar_make()
seeds <- tar_meta(targets_only = TRUE)$seed
draws <- purrr::map(seeds, \(seed) withr::with_seed(seed, runif(1e6)))
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.165394

RNGkind("L'Ecuyer-CMRG")
draws <- list()
seed <- .Random.seed
for (i in seq_len(100)) {
  seed <- parallel::nextRNGStream(seed)
  .Random.seed <- seed
  draws[[i]] <- runif(1e6)
}
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.154689

Update: I expanded https://books.ropensci.org/targets/random.html#measuring-independence to include comparisons like this one.

wlandau commented 1 year ago

Responding to some specific comments:

Hence, even if we can't guarantee the exact same results, the statistical validity of the 2 sets of results can be guaranteed. If some finding of statistical significance under one set of seeds does not hold under another then that in itself is problematic, is it not? Just to note, I am speaking as an economist rather than a statistician, so if some of my reasoning above is off, do feel free to point out!

It is still valuable to recapture and investigate anomalies, especially from a regulatory perspective. And this is much harder to do if reproducibility is only ex post. From the perspective of targets, that is the sort of damage that would come with recursive L'Ecuyer streams.

From another angle, the actual seeds used in targets are generated based on the target names (hashed to int, I believe), but say we replicated the code and data on a different machine, but just named the targets differently. Sure the seeds assigned will be different, and any random draws will be different, but they would both be the same experiment and both sets of results would be equally valid. Ex ante, I should have no reason to prefer one over the other.

Renaming a target changes its seed, but that doesn't matter for the kind of ex ante reproducibility we want for targets: if a regulator or journal reviewer is trying to reproduce a set of results, they will definitely use the original set of code and target names, but the ex post seeds may not be available if the original authors deleted the metadata.

In the same way, if you run targets recording the seeds used, and make changes along the way, the history of runs can affect the actual seeds supplied, but there should be no reason to prefer an arbitrarily defined set of seeds (by order in a DAG etc.).

That is why targets supports a configurable global seed. tar_option_set(seed = 123) will give a different set of reproducible results than tar_option_set(seed = 456). In addition, tar_option_set(seed = NA) avoids setting seeds altogether.

wlandau commented 1 year ago

Given https://github.com/ropensci/targets/issues/1139#issuecomment-1732570487, https://github.com/ropensci/targets/issues/1139#issuecomment-1732573814, and https://github.com/ropensci/targets/issues/1139#issuecomment-1732577944, I am going to pause implementation on this. From my perspective:

  1. L'Ecuyer-CMRG has its own problems.
  2. It is not clear just how problematic the naive set.seed() approach really is, especially in real life.
  3. Switching targets to recursive L'Ecuyer streams would cause significant harm to the package, as far as its ability to provide strong enough reproducibility.
wlandau commented 1 year ago

(https://github.com/ropensci/targets/commit/31b6733d333bd9bc07a7c3c029199bdaa55be9fc#commitcomment-128223702 summarizes the current state of https://github.com/ropensci/targets/tree/1139 and what would need to be done if continued).

wlandau commented 1 year ago

I will reopen this issue if it becomes clear that targets needs to change RNG strategies. A switch might or might not improve statistical validity, that part is unclear to me, but it would definitely hurt reproducibility.

In ordinary statistical workflows, you expect to set a seed in your main top-level R script, and you expect that that one seed will precisely determine everything that follows. Unlike the ordinary map/reduce calls of the future ecosystem, this expectation would break in targets with recursive L'Ecuyer streams.

shikokuchuo commented 1 year ago

I think pausing is sensible. Also just to be clear I'm not advocating for any change in targets, just thinking through the implications.

I'll just leave a couple of observations for when you come back to this:

  1. It would not work to just leave the generated seeds as part of metadata that could be deleted. It has to be mandated that the seeds are retained. Implementation could range from documenting that results are not complete without a set of seeds, all the way to having targets programmatically write the seeds back into the script.

  2. Either a pipeline has never been run in which case there are no seeds, and any 2 people running will get the same results, or a pipeline has been run, in which case a set of seeds has been generated. If this is shared with the other party then they can also get the same results. If after running a pipeline, the seeds are purposefully deleted, then we are back to the first case and running without seeds will still get us the same results for both parties. Publishing a set of results without the seeds would be incomplete and not reproducible, and prohibited as per (i).

  3. Requiring the seeds explicitly just removes the dependency, nothing else. As an example: in current targets, the seeds depend on the target names. A legitimate reason for changing the names may be as simple as wishing to correct a typo. In this case this will cause a re-run even though this is not necessary. If the seed for each target were supplied then this would not have to be the case. Similarly, if using recursive L'Ecuyer streams, there is a dependency now on the ordering rather than the names. Explicitly requiring the seeds just removes this dependency.

wlandau commented 1 year ago

Thanks, @shikokuchuo. I like how you laid all that out.

A legitimate reason for changing the names may be as simple as wishing to correct a typo. In this case this will cause a re-run even though this is not necessary.

For this reason, in drake I tried and failed to implement a way to programmatically rename a target without changing its seed or causing it to rerun. It turned out to be hopelessly messy and infeasible for both myself and users.

On this point, I will also add that target-name-based seeds guarantee uniqueness (aside from the vanishingly small chance of hash collisions in digest::digest2int()), which is already more robust than the PID + clocktime method that most of the rest of the world currently does. Also, a user trying to recapture or preserve an anomaly will not likely be renaming the affected targets. And if they follow good programming practices, the code will be under version control anyway, which means the RNG state can be easily recovered from the commit history.

In this case this will cause a re-run even though this is not necessary.

In targets, this would be the true regardless of the seed policy, and my exploration of this issue was part of the drake experiment I mention above.

If the seed for each target were supplied then this would not have to be the case.

If the seed were independent of the target, then yes, after the target reruns the result would be the same.

Similarly, if using recursive L'Ecuyer streams, there is a dependency now on the ordering rather than the names. Explicitly requiring the seeds just removes this dependency.

I have resisted adding a seed argument to tar_target() which would allow this (or with some modification, maybe even require it). I have tried to shield users from this burden because as we have found it is a surprisingly hard topic. But anyone is free to manually call set.seed() within an individual task.

I will add: It looks like parallel::nextRNGStream() advances the state by 2^127 steps, and parallel::nextRNGSubStream() advances it by 2^76 steps. If it is somehow possible, efficient, and helpful to advance a fixed global seed by 2^127 * digest::digest2int(x = "target_name", seed = targets_global_seed), this would break the recursion and be a much more feasible proposition for targets. But I suspect this just adds more epicycles without achieving the same level of parallel safety as recursive streams.

shikokuchuo commented 1 year ago

Using @shikokuchuo's tip about empirically assessing duplication, here is how an example targets pipeline currently compares to an example L'Ecuyer workflow. It doesn't rule out the possibility of bias altogether, but it can show how much extra bias is in a targets pipeline due to the naive set.seed() approach:

RNGkind("Mersenne-Twister")
library(targets)
tar_script({
  list(
    tar_target(x, seq_len(99L)),
    tar_target(y, x, pattern = map(x))
  )
})
tar_make()
seeds <- tar_meta(targets_only = TRUE)$seed
draws <- purrr::map(seeds, \(seed) withr::with_seed(seed, runif(1e6)))
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.165394

RNGkind("L'Ecuyer-CMRG")
draws <- list()
seed <- .Random.seed
for (i in seq_len(100)) {
  seed <- parallel::nextRNGStream(seed)
  .Random.seed <- seed
  draws[[i]] <- runif(1e6)
}
100 * mean(duplicated(unlist(draws))) # percent duplicated
#> [1] 1.154689

Update: I expanded https://books.ropensci.org/targets/random.html#measuring-independence to include comparisons like this one.

Just a note on this before I forget - I think you read too much into my suggestion, or you are attributing too much to me. I don't in fact think it works to try to measure this empirically in this way - it's a binary event either there is overlap or there isn't. I don't think we're capturing anything but noise in the above.

shikokuchuo commented 1 year ago

2. vignette("parallel", package = "parallel") says: "The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally safe enough, but there have been worries that the random-number streams in the workers might somehow get into step." It does not sound like the set.seed() approach is totally invalid or catastrophic, and "worries" sound like speculation to me. How often do you actually get overlapping streams in real life?

This "generally safe enough" pre-supposes that the seeds set are themselves randomly-generated. See https://www.sciencedirect.com/science/article/pii/S0378475416300829 L'Ecuyer 2017.

Quote: "A single RNG with a “random” seed for each stream. When jumping ahead is too expensive and the RNG has a huge period, it can be reasonable to just select a (pseudo)random seed for each stream, using another RNG if we want reproducibility. "

Hence I think you'd need to pair digest::hash2int with another RNG to generate the seed. Or replacing with a cryptographic hash function could work - I'm sure some experts might disagree, but it would seem to be equivalent.

wlandau commented 1 year ago

Just a note on this before I forget - I think you read too much into my suggestion, or you are attributing too much to me.

Sorry about that. NB in https://books.ropensci.org/targets/random.html I make no attributions except to R core and L'Ecuyer.

See https://www.sciencedirect.com/science/article/pii/S0378475416300829 L'Ecuyer 2017.

Thanks for the reference. It seems like the authors understand the kind of challenge targets faces:

There are situations where using one or more central monitor(s) to create the streams is not acceptable, e.g., because new streams have to be created dynamically and randomly within threads as a simulation goes on, without resorting to external information, and with a guarantee that the results are exactly reproducible [14,53]. In this setting, we want each stream to be able to create children streams by itself at any time, using a deterministic mechanism.

wlandau commented 1 year ago

Hence I think you'd need to pair digest::hash2int with another RNG to generate the seed.

Unfortunately that's not feasible, it's the same issue as L'Ecuyer streams where they all come from the same "central monitor".

Or replacing with a cryptographic hash function could work

That's definitely feasible to implement for targets. We would just need a nice way to map ordinary character hashes to 32-bit integers.

I'm sure some experts might disagree, but it would seem to be equivalent.

Would you elaborate? I am not sure I see the connection. A substantial justification would be extremely valuable.

shikokuchuo commented 1 year ago

Hence I think you'd need to pair digest::hash2int with another RNG to generate the seed.

Unfortunately that's not feasible, it's the same issue as L'Ecuyer streams where they all come from the same "central monitor".

I understood the paper to mean using the hash as the seed value to generate a random number (e.g. using MT), to then use as the L'Ecuyer seed. Everything is then deterministic and non-recursive (and it would seem feasible).

Or replacing with a cryptographic hash function could work

That's definitely feasible to implement for targets. We would just need a nice way to map ordinary character hashes to 32-bit integers.

I'm sure some experts might disagree, but it would seem to be equivalent.

Would you elaborate? I am not sure I see the connection. A substantial justification would be extremely valuable.

The paper's 'possibly good enough' solution is to use a 'random' seed for each L'Ecuyer stream. Their suggestion is to use another RNG - so map from a deterministic seed into random space. Here you have a hash function hash2int, but no assurance on its quality, so let's call the output 'correlated'. One solution would be to use an RNG to map this 'correlated' space into 'uncorrelated' or 'random' space. However a cryptographic hash, by definition, already maps into 'random' or 'uncorrelated' space. Why I anticipate potential disagreement (from statistical quarters) is that there is no guarantee on desirable statistical properties of a cryptographic hash, but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping.

wlandau commented 1 year ago

Why I anticipate potential disagreement (from statistical quarters) is that there is no guarantee on desirable statistical properties of a cryptographic hash, but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping... but I don't believe that is important here as we are not using it as an RNG and solely interested in the mapping.

I agree: I took a closer look at the paper, and I don't think the statistical properties matter (i.e. approximating U(0, 1)) for the RNG that generates the seeds. In subsection "A single RNG with a 'random' seed for each stream" (Section 4: How to produce parallel streams and substreams"), when the authors argue that overlap is extremely unlikely, their argument only relies on the magnitude of the period, and it does not depend on the statistical uniform-ness of that upstream RNG.

wlandau commented 1 year ago

So effectively, targets takes this approach:

A. For the output function g(), use a hash algorithm. This defines a counter-based RNG. (Section 3: "Main classes of RNGs for simulation", last subsection: "counter-based RNGs"). B. For the counter i in the counter-based RNG, use each target's own name instead of a recursive index. C. Use g(i) as the target seed (Section 4: How to produce parallel streams and substreams", subsection "A single RNG with a 'random' seed for each stream"). D. Within each target, use a multiple recursive generator (MRG) stream which starts with seed g(i).

Section 4 of the paper justifies (C) and (D).

For (B), the paper does not offer a rigorous justification. However, target name strings easily map to small integers, so I believe this is equivalent to taking small variable jumps ahead in the counter-based stream. As long as that counter-based RNG has a large period, the argument in Section 4 holds, and overlap is vanishingly unlikely in the long run.

(A) is the most challenging. We need the counter-based RNG to have a large period for the argument in Section 4 to hold. digest::digest2int() uses Jenkins' one-at-a-time hash, which is not cryptographic, and it is unclear how large the period is for digest::digest2int(i) as an RNG. Switching to a cryptographic hash may not be straightforward because:

  1. It would invalidate everyone's targets (which we could survive, but not without pain).
  2. Cryptographic hashes like MD5 are more than 32 bits, and as far as I know, set.seed() only accepts an ordinary 32-bit integer. So I would either need to find a 32-bit cryptographic hash, which may not exist, or map it to 32-bit space (e.g. with modular arithmetic or another hash), which may invalidate the hash's original cryptographic properties.
wlandau commented 1 year ago

Cryptographic hashes like MD5 are more than 32 bits, and as far as I know, set.seed() only accepts an ordinary 32-bit integer. So I would either need to find a 32-bit cryptographic hash, which may not exist, or map it to 32-bit space (e.g. with modular arithmetic or another hash), which may invalidate the hash's original cryptographic properties.

I suppose digest::digest2int(cryptographic_hash(target_name)) is fine since it at least goes through a cryptographic hash, which increases the period even if it does not increase the number of possible output values.

wlandau commented 1 year ago

Hmmm.... this might also slow down targets a lot.

library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5"),
  sha256 = digest(name, algo = "sha256"),
  sha512 = digest(name, algo = "sha512"),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq      mean  median      uq      max neval
#>     int   507   730.5   1105.23   811.5  1069.0    17599   100
#>     md5 17237 18389.0 133541.58 21435.0 24110.5 10555007   100
#>  sha256 19013 20269.5  24911.84 22746.0 24862.5   152663   100
#>  sha512 15633 17178.0  27703.10 19355.5 21550.5   406789   100
#>      mt 11186 13200.5  30920.48 13951.5 16907.0   475125   100

Created on 2023-10-12 with reprex v2.0.2

I know the nanonext hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.

wlandau commented 1 year ago

We don't need serialization, and that makes things 2x faster.

library(digest)
library(microbenchmark)

name <- "a"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5", serialize = FALSE),
  sha256 = digest(name, algo = "sha256", serialize = FALSE),
  sha512 = digest(name, algo = "sha512", serialize = FALSE),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq     mean  median      uq    max neval
#>     int   493   635.5   894.52   693.0   817.0  19359   100
#>     md5 10345 10700.0 13325.78 10904.5 11450.5 234411   100
#>  sha256 12306 12677.0 13282.54 13033.5 13580.5  29117   100
#>  sha512  9043  9541.5 10088.13  9954.0 10444.5  13210   100
#>      mt 11100 12095.5 12707.84 12509.0 12898.5  29362   100

name <- "longlonglonglonglonglonglonglonglonglonglonglonglonglonglonglong"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5", serialize = FALSE),
  sha256 = digest(name, algo = "sha256", serialize = FALSE),
  sha512 = digest(name, algo = "sha512", serialize = FALSE),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq     mean  median      uq   max neval
#>     int   558   681.0   777.75   747.5   807.5  3081   100
#>     md5 10466 10792.5 11365.36 11285.0 11840.0 13044   100
#>  sha256 12660 13078.5 13749.33 13440.0 14168.5 20253   100
#>  sha512  9021  9592.5 10440.70  9834.5 10391.0 48685   100
#>      mt 11662 12613.5 13318.57 13152.0 13564.0 33029   100

Created on 2023-10-12 with reprex v2.0.2

wlandau commented 1 year ago

From profiling #1168, the performance hit is should be less than 1% in most pipelines, and it is not likely to increase above 3.5% even with cue = tar_cue(file = FALSE).

wlandau commented 1 year ago

On a Mac it's a bit more overhead: 2% with tar_cue(file = TRUE), 5.8% with tar_cue(file = FALSE). But still not terrible and certainly not a bottleneck. The pipeline I was profiled was the one below (lots of quick small targets), and I ran proffer::pprof(tar_outdated(callr_function = NULL)) on an up-to-date pipeline.

library(targets)
library(tarchetypes)
tar_option_set(cue = tar_cue(file = FALSE))
list(
  tar_target(x, 1:5000),
  tar_target(ordinary_name, x, pattern = map(x))
)
shikokuchuo commented 1 year ago

I know the nanonext hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.

How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.

You do realise that nanonext::sha1() is same order of magnitude as your original digest2int(), so would be no discernible difference.

But it's very good you've implemented a 'correct' solution in any case.

Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.

wlandau commented 1 year ago

How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.

A few months ago, @MilesMcBain reported having trouble with Ubuntu 18.04: https://github.com/ropensci/targets/discussions/1069. Even with the impetus to upgrade, I wonder how many users still run these older OS's.

You do realise that nanonext::sha1() is same order of magnitude as your original digest2int(), so would be no discernible difference.

Is SHA1 is strong enough for this?

Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.

I thought about it, but it feels like modular arithmetic, which is essentially cutting off the most significant bits in the hash value. I don't know what cryptographic properties we lose if we just take part of the hash, so I thought it would be more rigorous to supply the entire hash to digest2int() instead. All this assumes set.seed() only takes 32-bit integers. I will investigate that part.

wlandau commented 1 year ago

Hmm... turns out seed in set.seed() needs to be a 32-bit integer. I will start an issue at https://github.com/HenrikBengtsson/Wishlist-for-R.