Closed dtolnay closed 3 years ago
I feel that this somewhat defeats the point of constrandom
which is to have a different value in every build. One option may be to use an environment variable to control the seed of the RNG, which would allow for reproducible builds.
Sure, an environment variable seed would be fine; I've added one. Combining it with the Span is still necessary though, because we don't want every const_random invocation in the same crate to emit the same number.
What I had in mind was to switch between span and getrandom depending on the presence of the environment variable, rather than making it a cfg
switch.
Ah okay. That approach would work less well for my use case. It requires the environment variable to be set during compilation of the crate containing the const_random invocation, whereas the approach in the PR can be set by passing a flag during compilation of const_random itself. With your approach, every Buck target that depends on const_random would need to remember to set up the right variable and nondeterminism can leak in if any forget, whereas with my approach we can enforce that no target in the monorepo is allowed to rely on const_random's nondeterminism.
@Amanieu
I feel that this somewhat defeats the point of constrandom which is to have a different value in every build
When it that useful? I can see wanting a different value per invocation site, and even (maybe) factoring in the compiler version to make it different across toolchain updates, but when is it useful from build to build?
If CONST_RANDOM_SEED
is changed and contains random data this would work well.
Is that actually easier though? It requires setting a env variable in a way that is deterministic for reproduction but also has entropy.
Perhaps there is a way to read more of the build environment. In the extreme, if the seed were the hash of all the files on the build path this would work well. Obviously that won't work exactly, but I wonder if there isn't some way to incorporate information, that will be consistent, but not predictable.
We specifically do not want entropy from the build environment.
For example, what about using args? If the build is being reproduced that should stay the same.
It contains the full path of all dependencies. Also a future version of may run proc macros in a separate process. If this is done, it may be quite possible that a non-deterministic path would be passed as argument. For example if it uses ipc-channel
.
Is the goal just to make it reproducible if invoked multiple times in identical situations, or is there some broader criteria? For example, what about using args?
The goal is to produce a consistent artifact whenever a build runs with identical inputs. Reproducible build systems like Bazel and Buck reason about all the "inputs" and "outputs" of every build step. Roughly speaking a rustc invocation takes one or more source files + some outputs of any dependencies + a runnable compiler, and produces some Rust artifact (.rmeta or .rlib or .so or binary, depending on what is being built).
One of many reasons that looking at args wouldn't serve this purpose is that the filesystem location of the compiler or linker shouldn't factor into the output. Running the same toolchain on the same inputs should produce the same artifact in different environments regardless of how the toolchain is being accessed.
@dtolnay I pushed a change to this branch, which I believe neatly resolves all of the above discussion. If this looks good to you, I'll merge and release.
Thank you! This mostly looks good. One thing that makes it not work so well for my use case is that it reads the environment variable during execution of the proc macro, rather than during compilation of const-random. That means we'd be relying on everyone to set it in all their Buck targets that depend on const-random, rather than being able to set it centrally in the one Buck target responsible for compiling the const-random crate. Using option_env!
instead of std::env
would solve this.
@dtolnay Done.
This crate has been causing mischief in my Buck-based codebase that uses distributed caching of build artifacts. In particular, distributed build caches incorporate an assumption that build outputs are a deterministic function of the inputs. When this is not the case, it can cause a mismatch between what we think should be in cache vs what can actually be retrieved from the cache by content hash.
I am interested in fixing this while still preserving the intent of the crate, rather than e.g.
return 4;
. So this PR introduces a cfg calleddeterministic_based_on_span
which causes the value emitted by const_random to be a deterministic hash based on the call site's proc_macro::Span.This is implemented using a rustc cfg rather than a Cargo feature because it isn't something that downstream code should be allowed to enable/disable. It is a property that the build environment decides.
Tested with:
which prints: