Open huonw opened 1 year ago
https://tanzu.vmware.com/content/blog/barriers-to-deterministic-reproducible-zip-files suggests for zip files:
-X
to the zip
invocationhttps://reproducible-builds.org/docs/archives/ has some suggests for tar
like --mtime
... but potentially GNU-only features, not supported by BSD tar (e.g. on macOS):
$ tar --sort=name \
--mtime="@${SOURCE_DATE_EPOCH}" \
--owner=0 --group=0 --numeric-owner \
--pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
-cf product.tar build
It looks like bazel pkg_tar
and pkg_zip
don't invoke the tar
/zip
system binaries, but rather has dedicated Python scripts that use https://docs.python.org/3/library/tarfile.html or https://docs.python.org/3/library/zipfile.html:
We could definitely use the Python stdlib, now that we will reliably have a Python interpreter in rule code to run Python on.
Alternatively we could port to rust, and use an intrinsic, which will like be very fast.
For brainstorming options, another would be invoking an explicitly downloaded tool as a 'normal' external tool backend. This gives advantages like being independently versioned from pants, so fixes can be pulled in more easily, similar to upgrading pex or other tools out-of-cycle. I don't know of any particular tool for this, though
Summary of some options so far:
Process
ExternalTool
with some handy pre-built reproducible archive maker binaries (i.e. replacement for the system tar
and zip
), that may or may not be highly optimisedpython_source(source="./archive_builder.py"); adhoc_tool(runnable="./archive_builder.py", ...)
, and 3 might be something like file(name="archiver", source=http_source(...)); adhoc_tool(runnable=":archiver", ...)
)For 5, I don't think we can assume all pants users use the Python backend 😉
Will the code using Python stdlib be used in a script invoked from a Process
?
Because with non-local environments (i.e., remote execution and Docker), the rules in question would otherwise have to download a potentially large blob to the local executor and then re-upload. Better to operate on large archives directly in the non-local execution environment.
With the current implementation, yes. Although they are current part of the immutable_input_digests.
We could consider switching that to a named cache, but thatd take work and validation.
Describe the bug
The
archive
target's output can change at a byte level even when its inputs is identical. This appears to be due to timestamps changing, but there may be other factors too (e.g. file sort order, permissions, groups/owners).This reproducer runs two builds of .zip and .tar archives, using a fixed file on disk. This may be worse if generating inputs to
archive
too (e.g.packages=[...]
or using the output ofshell_command
), but I haven't tested in detail.The
--no-pantsd --no-local-cache
args ensures that the archiving process definitely runs, as might happen when running on different machines (i.e. no shared cache).If I run that, I see (nondeterministic) output like:
Pants version 2.16.0rc0
OS macOS
Additional info N/A