Open sxa opened 2 years ago
@joyeecheung do you have any thoughts/suggestions on the snapshot front?
@sxa I assume that it should be possible to update https://github.com/nodejs/node/blob/main/deps/v8/tools/gen-postmortem-metadata.py
so that it generates data in a deterministic manner (maybe sorting them at some point) and it's really just a matter of doing that versus there being some fundamental reason whey it might not be possible?
That would be my assumption yes. The previous issues and PRs suggest this was resolved, so we'd just need to understand what happened and potentially reimplement a fix for it.
Pinging @warpfork for this thread. He's currently heads-down on building Warpforge and has a passion for reproducible builds and I was telling him that there's interest in making progress on this front for Node.js so there might be some potential for collaboration and the production of some tooling to both generate reproducible builds and also document & provide mechanisms for people to do it for themselves without all the pain that usually comes from doing it manually.
So @sxa, meet the most excellent and clever @warpfork; maybe you two should chat.
Thanks @rvagg - I've worked on reproducible build environments for another project, and the Node.js build is close other than those two issues above, so it should just require a little bit of knowledge about the Node processes involved in producing those two things to be able to resolve it, then we can possibly put some more robust things in place for making our builds reproducible by default, which I think should be the aim here.
@warpfork DO you have much experience on non-Linux reproducible build work (I've so far only looked at Linux for Node.js)
1. The production of `debug-support.cc` is nondeterministic - the definitions are not always in the same order within the file. It is generated by https://github.com/nodejs/node/blob/main/deps/v8/tools/gen-postmortem-metadata.py - If I copy in a consistent one the builds can be reproducible.
One source of indeterminism is here:
Instead of a set a dict can be used here (with None
values), as since Python 3.7 dict preserves order. Then use for line in dict.fromkeys(lines):
to output the lines in order.
Is there a todo on this issue or can it be closed?
Definitely still work to do and I still plan to look at it.
Just tried four consecutive builds with the latest code and the debug-support.cc
came out the same each time. There have been four V8 version bumps since I last tested (10.1.124.6 ->11.3.244.4) so either I'm lucky today or something in there has made things better, but potentially the fix in the earlier comment will no longer required.
This was tested with ./configure --without-node-snapshot
and setting SOURCE_DATE_EPOCH=0
in the environment.
With snapshots enabled the builds are still non-reproducible. due to node_snapshot.cc
(Note that node_code_cache.cc
referred to in https://github.com/nodejs/node/issues/29108 is no longer a problem on the Linux system being used.
Looks like the break was between these two commits, so https://github.com/nodejs/node/commit/1faf6f459f likely made it non-reproducible again.
* 1faf6f459f 2019-04-21 | src: snapshot Environment upon instantiation (HEAD, refs/bisect/bad) [Joyee Cheung]
* f04538761f 2020-04-30 | tools: enable Node.js command line flags in node_mksnapshot (refs/bisect/good-f04538761f5bb3c334d3c8d16d093ac0916ff3bc) [Joyee Cheung]
FYI @joyeecheung (PR) @bnoordhuis (PR)
I don't think I have enough knowledge of this code to be able to propose a safe solution here so looking for advice.
@joyeecheung Would you be able to assist with the reproducibility of the snapshots now? I'm not sure who else might have good knowledge of the snapshot support in Node so anyone else might have to start from scratch.
Thanks for the ping, not sure how I missed the earlier one. I diff'ed the generated snapshot locally and found ~20 lines of differences in the snapshot.cc generated (with --predictable
, which we should also set for mksnapshot), most notably the binding data are initialized in different order, not sure why but they are supposed to be initialized in the same order. I'll look into why that happens.
With this patch the snapshot.cc difference is down to 13 lines. Still need to figure out the differences in the blob though.
Sounds like good progress - thanks @joyeecheung!
Checking the snapshots again, another source of indeterminism comes from performance data (milestones, time origin etc.), the easiest way to fix it is probably discarding it before snapshot generation. But I'll need to check if/how they should be synchronized.
With https://github.com/nodejs/node/pull/48702 and https://github.com/nodejs/node/pull/48708 and --predictable the differences are down to 8 places:
We probably need some V8 patches to make it deterministic (my guess is, v8 doesn't actually need to copy the exact values of those slots into the snapshot, they are only there as place holders, so v8 should probably just copy the same amount of 0s for those - still working locally to see if this is correct)
EDIT: we can just return some non-empty data for both BaseObject slots to fix 1. 2 probably need a V8 patch for us to customize how the slots should be serialized. Trying to work out a prototype.
I've put https://ci.nodejs.org/job/reproducibility-test/ in place to test how reproducible the builds are on Linux/aarch64 - it will run weekly:
--without-node-snapshot
and fail the job if they are not identicalnode_snapshot.o
to differ) or a red failure status if there are more differences.Thanks, I think when https://github.com/nodejs/node/pull/48851 lands, we could also add another flag to display the node_snapshot.cc
in a more human readable way, then the CI can just diff the two out/Release/gen/node_snapshot.cc
for a nicer output.
Thanks, I think when nodejs/node#48851 lands, we could also add another flag to display the
node_snapshot.cc
in a more human readable way, then the CI can just diff the twoout/Release/gen/node_snapshot.cc
for a nicer output.
Yeah that's probably more useful for the future :-)
Job updated to it will diff the .cc
file and store the output e.g. https://ci.nodejs.org/job/reproducibility-test/lastSuccessfulBuild/artifact/snapshot.cc-diff.log
I am adding a flag to make the diff output more readable (it would easier to calculate offsets of the differences in the data): https://github.com/nodejs/node/pull/49312
With the flag the diff output is currently
9c9
< static const char v8_snapshot_blob_data[] = {4,0,0,0,1,0,0,0,52,90,-119,-9,49,49,46,51,46,50,52,52,46,56,45,110,111,100,101,46,49,52,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0
---
> static const char v8_snapshot_blob_data[] = {4,0,0,0,1,0,0,0,-125,90,-8,1,49,49,46,51,46,50,52,52,46,56,45,110,111,100,101,46,49,52,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0
13318c13318
< 16,75,98,0,0,0,0,16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,44,3,-39,3,-128,93,68,102,0,0,0,0,0,0,0,0,49,72,-16,-115,-31,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 13309
---
> 16,75,98,0,0,0,0,16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,44,3,-39,3,-128,93,68,102,0,0,0,0,0,0,0,0,-55,17,-16,-115,-31,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 13309
13321c13321
< 0,0,0,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,44,-108,-128,93,68,102,0,0,0,0,0,0,0,0,-31,100, // 13312
---
> 0,0,0,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,44,-108,-128,93,68,102,0,0,0,0,0,0,0,0,73,98, // 13312
13963c13963
< -63,49,-89,112,47,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 13954
---
> -71,6,-16,-115,-31,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 13954
15751c15751
< 0,0,0,12,56,68,96,0,-66,2,56,1,0,0,0,68,71,68,71,98,0,0,0,0,0,0,0,0,-128,125,112,55,1,0,0,0,-28,80,-13,5,1,0,0,0,68,1,28,3,76,-128,93,5,-7,8,-107,6,-51,11,8,-15,9,1,28,-105, // 15742
---
> 0,0,0,12,56,68,96,0,-66,2,67,1,0,0,0,68,71,68,71,98,0,0,0,0,0,0,0,0,-128,125,112,66,1,0,0,0,36,-72,-59,3,1,0,0,0,68,1,28,3,76,-128,93,5,-7,8,-107,6,-51,11,8,-15,9,1,28,-105, // 15742
With https://github.com/nodejs/node/pull/48749 it's down to 2 differences (technically just one, the first one is the checksum being different), which comes from the Environment pointer in the context embedder data. I am still working on a V8 API to address this (we could also do a dumb workaround by setting it to nullptr temporarily before serialization, though)
9c9
< static const char v8_snapshot_blob_data[] = {4,0,0,0,1,0,0,0,-16,89,88,-80,49,49,46,51,46,50,52,52,46,56,45,110,111,100,101,46,49,52,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0
---
> static const char v8_snapshot_blob_data[] = {4,0,0,0,1,0,0,0,45,91,5,18,49,49,46,51,46,50,52,52,46,56,45,110,111,100,101,46,49,52,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0
15751c15751
< 0,0,0,12,56,68,96,0,68,-127,81,1,0,0,0,68,71,68,71,98,0,0,0,0,0,0,0,0,0,105,96,81,1,0,0,0,100,-112,39,2,1,0,0,0,68,1,28,3,76,-128,93,5,-7,8,-107,6,-51,11,8,-15,9,1,28,-105, // 15742
---
> 0,0,0,12,56,68,96,0,-118,3,29,1,0,0,0,68,71,68,71,98,0,0,0,0,0,0,0,0,-16,-98,-32,28,1,0,0,0,100,80,-100,6,1,0,0,0,68,1,28,3,76,-128,93,5,-7,8,-107,6,-51,11,8,-15,9,1,28,-105, // 15742
FYI, just landed the --write-snapshot-as-array-literals flag to the configure script
With https://github.com/nodejs/node/pull/50983 locally I can generate reproducible snapshots, though the binary still differ
I have been trying and failing to get any version of node to build bit-for-bit reproducible.
What is the most recent version of node anyone has managed to reproduce, with what exact build steps?
Most recently I attempted to build 20.11.1 as follows:
python configure.py \
--prefix=/usr \
--dest-cpu=x86_64 \
--shared-openssl \
--shared-zlib \
--write-snapshot-as-array-literals \
--openssl-use-def-ca-store
make BUILDTYPE=Release
Here is just a portion of the massive diff that I get:
I only succeeded with --without-node-snapshot
I had no luck with that. What version?
that was 18.18.2
Without https://github.com/nodejs/node/pull/50983 The snapshot would not be reproducible. The V8 patch just landed last week in the upstream, I plan to backport it soon once I fix a small regression the patch caused.
Just tried 18.18.2 with:
python configure.py \
--prefix=/usr \
--shared-openssl \
--shared-zlib \
--without-node-snapshot \
--without-npm \
--without-corepack \
--openssl-use-def-ca-store
make BUILDTYPE=Release
Here is the full Containerfile if anyone wants to test.
Still can't get a match, though I would note I am building against musl rather than glibc as with nix.
Saw some diffs were ICU related so I also just tried with --with-intl=system-icu
with similar results.
I have been trying and failing to get any version of node to build bit-for-bit reproducible. What is the most recent version of node anyone has managed to reproduce, with what exact build steps? Most recently I attempted to build 20.11.1 as follows:
I would suggest that with all of the configure options you're using (plus using a musl instead of glibc) it may take a bit of work to identify the source of your problems from this.
Similar to @haraldh I can confirm that the tip of the nodejs/node repository as of today on Linux/aarch64 can run two consecutive identical builds by setting SOURCE_DATE_EPOCH=0
and with no configure parameters other than --without-node-snapshot
. You may want to try with that first on your system, and then look at "bisecting" the other parameters you've got in your container file to identify whether any of them are causing your issues.
My tests were on Fedora 39 with gcc 13.2.1 but I've used earlier compilers when doing this in the past with the same results, and I have a CI job that does a regular check of the tip with gcc 9.3.0 on Ubuntu 20.04 which still seems to be working ok.
18.18.2 didn't produce identical builds for me, although --without-snapshot
on that release is giving a message WARNING: building --without-snapshot is no longer possible
so I suspect it didn't take effect.
The V8 patch just landed last week in the upstream, I plan to backport it soon once I fix a small regression the patch caused.
That's great news @joyeecheung! Sounds like we're really close now.
Okay so when I build 20.11.1 twice with only --without-node-snapshot
on a Ryzen 2990WX I still get a diff, but at least a tiny one of only a few characters.
Will try again with latest tip and see if that one difference is corrected and share if not.
I was able to conclude the big diffs I provided above, were only from cases where I am doing parallel builds with identical containerized build environments except for the CPU.
In one case I am building with 20 cores of a Threadripper 2990WX 32 core, and in another I was building with ~all 32 cores of a AMD EPYC 7502P 32 core.
In stagex (the deterministic Linux distro we are trying to add nodejs to) we have seen this behavior a few times in other packages, and was always because the compilation system was either making optimizations based on the CPU sub-architecture, or embedding information about the CPU or build environment into early headers.
In either case we end up with a huge shift in headers as per my diffs above.
Anyone more familiar with nodejs building have ideas on how I can maybe hardcode away any CPU-specific behavior or metadata beyond "generic x86_64"?
Yeah I get different output on different CPU types too (Most notably in the deps/base64
)
My .a file differences in a build between Intel and AMD based systems are in the following .a
files libnghttp2
libnode
libicuucx
libucii18n
libicutools
Interesting. My biggest differences above also all seemed ICU related. Did you try comparing with using system icu?
No I haven't tried that
I landed the regression fix for my V8 patch, looking into backporting the V8 patches and rebasing https://github.com/nodejs/node/pull/50983 - with the patches the snapshot part should be reproducible, though it seems there are some bits beyond snapshots that are not reproducible. (It seems https://github.com/nodejs/build/issues/3043#issuecomment-2002783252 matches my findings in https://github.com/nodejs/build/issues/3043#issuecomment-1834403032, I was using Ubuntu (don't remember the version now))
In https://github.com/nodejs/node/pull/50983 the snapshots are made reproducible with a test that diffs the snapshots passing in the CI. Would need some reviews to push it forward though.
With https://github.com/nodejs/node/pull/50983 landed on Ubuntu I am able to get identical builds with SOURCE_DATE_EPOCH=0
and ./configure --ninja
With nodejs/node#50983 landed on Ubuntu I am able to get identical builds with
SOURCE_DATE_EPOCH=0
and./configure --ninja
This is an awesome result @joyeecheung! Thank you! I've just done a test on Linux/aarch64 and confirmed that two consecutive builds on Ubuntu 22.04 come out identical on the same machine (I was running without --ninja
and without ccache
). I'll run a few extra tests but this feels like a great achievement and hopefully we can do these verifications continuously to trap if we get any regressions with it in the future. It looks like by reproducibility test job isn't running properly just now but I'll look at getting that working again.
It looks like by reproducibility test job isn't running properly just now but I'll look at getting that working again.
From a quick glance, the job needs to be using a newer compiler than gcc 9.4.0.
It looks like by reproducibility test job isn't running properly just now but I'll look at getting that working again.
From a quick glance, the job needs to be using a newer compiler than gcc 9.4.0.
Yep - sorted and the build is now good.
I've also verified that it's reproducible whether or not you use ccache
so that's good too (I'm aware of some issues seen at other projects with ccache enabled, so I needed to check it)
@joyeecheung Would it be possible to put these fixes back to earlier Node versions, or will the V8 levels in those not be easy to convert? It'll be a good story for the next major release regardless though!
For v22.x the changes should land cleanly or don't need too much work to backport. For 20.x, the necessary V8 API changes would be ABI breaking although I think it can still be backportable by either backporing the V8 API changes in a non-ABI breaking way (adding new signatures instead of appending parameters), or nulling out the pointers on our end before snapshot serialization (might be easier that way anyway?). For 18.x that might be harder since it's already quite different.
Are the reproducibility tests only comparing doing a build twice on identical hardware?
Node 22.4.0 is not reproducible in our testing on stagex, when building on two different ryzen machines with different specs.
See my comment here: https://github.com/nodejs/build/issues/3043#issuecomment-2002783252
I tried CFLAGS="-march=x86-64 -mtune=generic -O2"
to try to avoid any cpu-specific optimizations but it seems this was not enough.
Hmm, I am pretty sure it is related to the CPU features hash V8 adds to the code cache, which is encoded as a bit field including the following properties: https://chromium.googlesource.com/v8/v8/+/7c7c6b475c6edcd7ef863e1f668be3df55c56307/src/codegen/cpu-features.h#15 - so it's not strictly required that the hardware must be identical, but the differences need to be out of the probed set.
We also add the cached version tag (which includes this bitfield) to the custom snapshot blob, but for the built-in snapshot at least we can remove it since we don't check it for built-in snapshot anyway. But the ones in the code cache would be harder to get rid of as V8 has some concerns about skipping the CPU feature test (see the discussions in https://chromium-review.googlesource.com/c/v8/v8/+/4905290). I wonder if we can convince the upstream to change the bits set in the code cache to be a combination of "CPU features required by this code cache" instead of "CPU features of the platform that compiles this code cache" (the two aren't necessarily the same and at least for now, the former is basically "none" because V8 doesn't include any optimized code in the code cache, though they want to reserve the ability to make it CPU-specific. But at least making it more specific about the actual requirement instead of doing a blind equality check would help our use case).
Oh wait you are already using --without-node-snapshot
, then I am not quite sure what else is varying - though according to https://github.com/nodejs/build/issues/3043#issuecomment-2003790335 other than libnode it comes from libnghttp2
, libicuucx
and libucii18n
and libicutools
Confirmed I am finally able to build nodejs 22.7.0 reproducibly across multiple different systems/cpus:
https://codeberg.org/stagex/stagex/pulls/95/files
--without-node-snapshot is no longer an option in this release so it builds with it enabled, however the issue (as suspected earlier) was nodejs currently does not build some deps (icu, openssl, brotli, libev, nghttp2, c-ares) deterministically.
I had to package all those myself separately and get those deterministic, then have node build against the system versions instead of in-tree versions, but it works!
Are the reproducibility tests only comparing doing a build twice on identical hardware?
Yes. And it's done on aarch64 (chosen due to the fact we have machines with very high number of cores there which means it's feasible to run the test without ccache without them taking up too much machine time!)
nodejs currently does not build some deps (icu, openssl, brotli, libev, nghttp2, c-ares) deterministically.
That's useful to know - thank you. Although it's interesting that it's only showing up when building on different machines. Do you know if it's definitely a problem for all of those that you mentioned?
In the dim and distant past before global lockdowns there was some previous work to make the build process binary reproducible.
I had a play recently with the process (on Linux/aarch64 because they are the quickest machines I have access to) and I hit two issues:
debug-support.cc
is nondeterministic - the definitions are not always in the same order within the file. It is generated by https://github.com/nodejs/node/blob/main/deps/v8/tools/gen-postmortem-metadata.py - If I copy in a consistent one the builds can be reproducible.configure --without-node-snapshot
), which suggests that the changes made in https://github.com/nodejs/node/issues/29108 are no longer valid for the current release.Tagging @ChALkeR who put the original PR in (albeit three years ago!)