pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

`archive` is not reproducible #18669

Open huonw opened 1 year ago

huonw commented 1 year ago

Describe the bug

The archive target's output can change at a byte level even when its inputs is identical. This appears to be due to timestamps changing, but there may be other factors too (e.g. file sort order, permissions, groups/owners).

This reproducer runs two builds of .zip and .tar archives, using a fixed file on disk. This may be worse if generating inputs to archive too (e.g. packages=[...] or using the output of shell_command), but I haven't tested in detail.

The --no-pantsd --no-local-cache args ensures that the archiving process definitely runs, as might happen when running on different machines (i.e. no shared cache).

cd $(mktemp -d)

cat > pants.toml <<EOF
[GLOBAL]
pants_version = "2.16.0rc0"

backend_packages = []

[anonymous-telemetry]
enabled = false
EOF

echo 1 > foo.txt

cat > BUILD <<EOF
file(name="file", source="foo.txt")
for format in ["zip", "tar"]:
    archive(name=format, format=format, files=[":file"])
EOF

pants --no-pantsd --no-local-cache package ::
cp -r dist first

sleep 3 # ensure the invocations are separated

pants --no-pantsd --no-local-cache package ::
cp -r dist second

# BUG: these are different
shasum first/* second/*

# BUG: for tar, it is specifically the (octal) timestamp that differs
diff -U3  <(hexdump -C first/tar.tar) <(hexdump -C second/tar.tar)

If I run that, I see (nondeterministic) output like:

9c0f0769e85cc4a6acc490f92ea32b79862a953a  first/tar.tar
669f4c618dac7c3590f392f9c5f23670731456d6  first/zip.zip
81b0b4dba3c99855642ce468500c89afa8c9f862  second/tar.tar
9be595741557454a409d243aa0bf84ac91f7b5e2  second/zip.zip
--- /dev/fd/11  2023-04-04 14:53:36.000000000 +1000
+++ /dev/fd/12  2023-04-04 14:53:36.000000000 +1000
@@ -4,7 +4,7 @@
 00000060  00 00 00 00 30 30 30 36  34 34 20 00 30 30 30 37  |....000644 .0007|
 00000070  36 35 20 00 30 30 30 30  32 34 20 00 30 30 30 30  |65 .000024 .0000|
 00000080  30 30 30 30 30 30 32 20  31 34 34 31 32 37 32 36  |0000002 14412726|
-00000090  35 30 30 20 30 31 32 35  30 37 00 20 30 00 00 00  |500 012507. 0...|
+00000090  35 31 30 20 30 31 32 35  31 30 00 20 30 00 00 00  |510 012510. 0...|
 000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 *
 00000100  00 75 73 74 61 72 00 30  30 68 75 6f 6e 00 00 00  |.ustar.00huon...|

Pants version 2.16.0rc0

OS macOS

Additional info N/A

huonw commented 1 year ago

https://tanzu.vmware.com/content/blog/barriers-to-deterministic-reproducible-zip-files suggests for zip files:

  1. touching everything on disk to a specific timestamp
  2. passing -X to the zip invocation

https://reproducible-builds.org/docs/archives/ has some suggests for tar like --mtime... but potentially GNU-only features, not supported by BSD tar (e.g. on macOS):

$ tar --sort=name \
      --mtime="@${SOURCE_DATE_EPOCH}" \
      --owner=0 --group=0 --numeric-owner \
      --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
      -cf product.tar build

It looks like bazel pkg_tar and pkg_zip don't invoke the tar/zip system binaries, but rather has dedicated Python scripts that use https://docs.python.org/3/library/tarfile.html or https://docs.python.org/3/library/zipfile.html:

thejcannon commented 1 year ago

We could definitely use the Python stdlib, now that we will reliably have a Python interpreter in rule code to run Python on.

Alternatively we could port to rust, and use an intrinsic, which will like be very fast.

huonw commented 1 year ago

For brainstorming options, another would be invoking an explicitly downloaded tool as a 'normal' external tool backend. This gives advantages like being independently versioned from pants, so fixes can be pulled in more easily, similar to upgrading pex or other tools out-of-cycle. I don't know of any particular tool for this, though

Summary of some options so far:

  1. carefully call system utilities to give the desired result, may require non-portable extensions
  2. packaged up Python script that's distributed with pants but called via the normal Process
  3. an ExternalTool with some handy pre-built reproducible archive maker binaries (i.e. replacement for the system tar and zip), that may or may not be highly optimised
  4. a Rust intrinsic
  5. wild idea: a distributed macro for 2 or 3 (e.g. 2 might be a macro something like python_source(source="./archive_builder.py"); adhoc_tool(runnable="./archive_builder.py", ...), and 3 might be something like file(name="archiver", source=http_source(...)); adhoc_tool(runnable=":archiver", ...))
thejcannon commented 1 year ago

For 5, I don't think we can assume all pants users use the Python backend 😉

tdyas commented 1 year ago

Will the code using Python stdlib be used in a script invoked from a Process?

Because with non-local environments (i.e., remote execution and Docker), the rules in question would otherwise have to download a potentially large blob to the local executor and then re-upload. Better to operate on large archives directly in the non-local execution environment.

thejcannon commented 1 year ago

With the current implementation, yes. Although they are current part of the immutable_input_digests.

We could consider switching that to a named cache, but thatd take work and validation.