tkaitchuck / constrandom

Macro to generate random constants in Rust https://xkcd.com/221/
Apache License 2.0
73 stars 14 forks source link

Add option to randomize only based on span #13

Closed dtolnay closed 3 years ago

dtolnay commented 4 years ago

This crate has been causing mischief in my Buck-based codebase that uses distributed caching of build artifacts. In particular, distributed build caches incorporate an assumption that build outputs are a deterministic function of the inputs. When this is not the case, it can cause a mismatch between what we think should be in cache vs what can actually be retrieved from the cache by content hash.

I am interested in fixing this while still preserving the intent of the crate, rather than e.g. return 4;. So this PR introduces a cfg called deterministic_based_on_span which causes the value emitted by const_random to be a deterministic hash based on the call site's proc_macro::Span.

This is implemented using a rustc cfg rather than a Cargo feature because it isn't something that downstream code should be allowed to enable/disable. It is a property that the build environment decides.


Tested with:

use const_random::const_random;

fn main() {
    const RANDOM: (u32, u32) = (const_random!(u32), const_random!(u32));
    println!("{:?}", RANDOM);
}

which prints:

(3560837894, 4282915628)
Amanieu commented 4 years ago

I feel that this somewhat defeats the point of constrandom which is to have a different value in every build. One option may be to use an environment variable to control the seed of the RNG, which would allow for reproducible builds.

dtolnay commented 4 years ago

Sure, an environment variable seed would be fine; I've added one. Combining it with the Span is still necessary though, because we don't want every const_random invocation in the same crate to emit the same number.

Amanieu commented 4 years ago

What I had in mind was to switch between span and getrandom depending on the presence of the environment variable, rather than making it a cfg switch.

dtolnay commented 4 years ago

Ah okay. That approach would work less well for my use case. It requires the environment variable to be set during compilation of the crate containing the const_random invocation, whereas the approach in the PR can be set by passing a flag during compilation of const_random itself. With your approach, every Buck target that depends on const_random would need to remember to set up the right variable and nondeterminism can leak in if any forget, whereas with my approach we can enforce that no target in the monorepo is allowed to rely on const_random's nondeterminism.

jsgf commented 4 years ago

@Amanieu

I feel that this somewhat defeats the point of constrandom which is to have a different value in every build

When it that useful? I can see wanting a different value per invocation site, and even (maybe) factoring in the compiler version to make it different across toolchain updates, but when is it useful from build to build?

tkaitchuck commented 4 years ago

If CONST_RANDOM_SEED is changed and contains random data this would work well. Is that actually easier though? It requires setting a env variable in a way that is deterministic for reproduction but also has entropy. Perhaps there is a way to read more of the build environment. In the extreme, if the seed were the hash of all the files on the build path this would work well. Obviously that won't work exactly, but I wonder if there isn't some way to incorporate information, that will be consistent, but not predictable.

dtolnay commented 4 years ago

We specifically do not want entropy from the build environment.

dtolnay commented 4 years ago
bjorn3 commented 3 years ago

For example, what about using args? If the build is being reproduced that should stay the same.

It contains the full path of all dependencies. Also a future version of may run proc macros in a separate process. If this is done, it may be quite possible that a non-deterministic path would be passed as argument. For example if it uses ipc-channel.

dtolnay commented 3 years ago

Is the goal just to make it reproducible if invoked multiple times in identical situations, or is there some broader criteria? For example, what about using args?

The goal is to produce a consistent artifact whenever a build runs with identical inputs. Reproducible build systems like Bazel and Buck reason about all the "inputs" and "outputs" of every build step. Roughly speaking a rustc invocation takes one or more source files + some outputs of any dependencies + a runnable compiler, and produces some Rust artifact (.rmeta or .rlib or .so or binary, depending on what is being built).

One of many reasons that looking at args wouldn't serve this purpose is that the filesystem location of the compiler or linker shouldn't factor into the output. Running the same toolchain on the same inputs should produce the same artifact in different environments regardless of how the toolchain is being accessed.

tkaitchuck commented 3 years ago

@dtolnay I pushed a change to this branch, which I believe neatly resolves all of the above discussion. If this looks good to you, I'll merge and release.

dtolnay commented 3 years ago

Thank you! This mostly looks good. One thing that makes it not work so well for my use case is that it reads the environment variable during execution of the proc macro, rather than during compilation of const-random. That means we'd be relying on everyone to set it in all their Buck targets that depend on const-random, rather than being able to set it centrally in the one Buck target responsible for compiling the const-random crate. Using option_env! instead of std::env would solve this.

tkaitchuck commented 3 years ago

@dtolnay Done.