Open marcosbc opened 1 year ago
Thanks for surfacing this issue. I had noticed that vendor.tar.gz
changes on every run and would like to eliminate that if we have the necessary controls to do so. Considerations to investigate:
Is this limited to tarballs or all supported archive formats? I'll check, but expect that this is due to the behaviour of go mod vendor
, see next item.
This may be a good time to mention that with the move to use libarchive, we support more archive formats than I think we need. If we had to choose e.g. because special case handling per-format is needed for archive reproducibility, I would prioritize .tar.gz
, .obscpio
and .zstd
.
go mod vendor
does at a minimum write files with new time metadata on each run. We may not be able to do anything about that bevaviour. There are no command options to do a dry run and discard if no changes:
go help mod vendor
usage: go mod vendor [-e] [-v] [-o outdir]
Vendor resets the main module's vendor directory to include all packages
needed to build and test all the main module's packages.
It does not include test code for vendored packages.
The -v flag causes vendor to print the names of vendored
modules and packages to standard error.
The -e flag causes vendor to attempt to proceed despite errors
encountered while loading packages.
The -o flag causes vendor to create the vendor directory at the given
path instead of "vendor". The go command can only use a vendor directory
named "vendor" within the module root directory, so this flag is
primarily useful for other tools.
See https://golang.org/ref/mod#go-mod-vendor for more about 'go mod vendor'.
FWIW, I would be very happy to see upstream go expose the module and vendoring machinery as a library. The functions are there, but the interface is not stable/public at this time.
Setting file times under vendor/
to the file times of go.mod
is one idea that comes to mind. That would be stable when the source archive is stable.
This is not entirely straightforward to implement: the most common packaging pattern is to build the source archive via obs_scm / tar_scm
. This works well across the git-centric Go ecosystem and the changelog generation is valuable to packagers. Thus the file times of go.mod
may also change across repeated service runs. I'll check to confirm this behavior.
We don't want to presume that any metadata under vendor/
e.g. modules.txt
we check would be a sufficient reason to use a previous version of the contents of vendor/
. The correctness of the contents of vendor/
must take the highest priority. I'm willing to take whatever steps we can to normalize/quantize superfluous metadata e.g. fine grained file times on inputs to archive creation if that helps and if we have those controls. I've not previously worked with libarchive
to control these aspects of archive reproducibility. If it has applicable controls via function arguments, I'm open to using them to address this issue.
A strategy of preserving two vendor-new/
and vendor/
then comparing them seems like it might have numerous modes of failure or correctness error. I'm open to ideas for strategies that can work reliably.
One way to handle that is to set the files times in the archive to the same as the go.mod
(or would it be better to use go.sum
here) mtime.
More things are needed for gz
and obscpio
as these format have other variants embedded:
--reproducible
option)I created #55 to do this, so it works for at least .tar.gz
, .tar.xz
and .tar.zst
formats.
Cpio is out of scope of this PR as it requires things that are not doable with libarchive as of today.
We are observing that tarballs generated by
obs-service-go_modules
are not idempotent, i.e. the archives are not bit-identical after execution even if there are no changes in the file contents, and therefore their checksums differ. For us, having idempotent archives is useful as a re-execution of the source service with identical file contents would avoid the file being stored again in our repositories.For example, executing this service twice gives different results even if the file contents are the same:
In this example we were using the
tarsum
script (you can find it here) to calculate the checksum of each individual files inside the archive. And as you can see, it is identical for both cases so the actual contents of the archive is the same.Note that in other plugins such as
obs-service-node_modules
, this does not seem to happen since a re-execution generates bit-identical archives.