unisonweb / unison

A friendly programming language from the future
https://unison-lang.org
Other
5.69k stars 265 forks source link

Deterministic way to pull codebase #4811

Open ceedubs opened 4 months ago

ceedubs commented 4 months ago

Tools like Bazel and Nix ensure reproducible builds by constraining IO at build time. One way that Nix enforces this (I assume Bazel too?) is by only allowing builds to perform network activity if the result has a fixed output hash. Unfortunately, a pull from Share does not result in a file with a fixed hash. I suspect that two culprits are timestamps (like in the reflog) and fetches happening in parallel, but for all I know it could be that SQLite is just completely incompatible with deterministic file hashes (unlike a git codebase).

So far in Nix builds I have gotten around this by only saving the result of compile and not the whole codebase. But this isn't an ideal solution for a couple of reasons:

Some notes on the properties I care about:

Related (but more helpful for Docker than Bazel/Nix): #3892

Side note: it seems a bit ironic that this is hard in Unison, a language premised on code being content-addressed, when it comes for free(ish) in just about any language that uses text files and traditional source control 😬.

aryairani commented 3 months ago

Yeah this is a bummer. I was surprised to be reminded that git reflog doesn't include timestamps.

I did a basic sqlite test (create a table and add two rows), and that did produce identical results in two trials.

It would be nice to know if anyone is using reflog timestamps. They seem nice, but I'm not sure I've used them. They also are a culprit in some nondeterministic transcript outputs, which cause CI to fail.

ceedubs commented 3 months ago

@aryairani is it really just reflog timestamps? I assumed that if I did a pull or clone it would fetch a bunch of stuff in parallel which would result in different orders of rows in my SQLite tables.

aryairani commented 3 months ago

@ceedubs I'm not sure about the parallel fetches, I would guess that you're right.

I think that fetching stuff in parallel may not be that useful though and we might consider turning the number of concurrent fetches to 1 or something, which then should help.

Side note, I just talked to @rlmark who definitely uses the reflog timestamps.

aryairani commented 3 months ago

Related https://reproducible-builds.org/docs/timestamps/

aryairani commented 3 months ago

https://github.com/wolfcw/libfaketime