Closed spearce closed 5 years ago
Downloading this URL: https://go.googlesource.com/net/+archive/d75b190.tar.gz?dummy=/golang-net-d75b190.tgz
always produces a different tar archive. Contents are the same, but tar metadata is
different. This prevents using such URLs as fingerprinted distfiles in FreeBSD.
It might be that https://go.googlesource.com uses BSD tar that suffers from this bug:
libarchive/libarchive#623 (in this case the fix is switching to GNU tar).
Reported by None
on 2015-12-13 16:27:54
Thanks for filing this bug. Its not Gerrit, but Gitiles that generates these (and even
then I think its JGit's fault). But its certainly not Gerrit. So I'm going to move
this over to the Gitiles project where its more likely to get attention.
Reported by None
on 2015-12-14 07:03:42
FWIW, this makes it somewhat annoying to follow Bazel best practices when using http_archive
for external dependencies hosted on googlesource.com. Specifically, the best practices are to use http_archive
rather than git_repository
and to include a sha256sum
.
FYI, this is pretty unlikely to be fixed due to needing to break a public API inside JGit, which requires a major version bump, and that doesn't happen often.
The best practice to get files from Git is to use the Git wire protocol to git clone
the repository. If only a single version is needed, use --depth 1
to get a shallow clone.
If Bazel doesn't want to use Git to fetch source files from Git, then best practice should be to export the files as a tarball and store that tarball in another, non-Git persistent location where the exact bytes of that stream are unlikely to change.
Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.
Bazel can use git directly, but it doesn't support shallow clones and therefore unnecessarily fetches all of the history for a repo. Their suggestion is to use http_archive
to fetch a tarball for this use case.
IMHO there should be a feature request against Bazel to support shallow clones. It should be trivial to add the --shallow
flag.
As Shawn says, using a dynamically generated compressed file is still a bad idea for this use case. Even if we fix JGit/Gitiles to generate a deterministic sequence of bytes at a given server version, we have no way to ensure that the given sequence of bytes remains deterministic across server versions. We may depend on the JDK's zlib implementation for compressing objects, and there is no guarantee that that implementation is going to always produce the same byte sequence across JDK versions. Similarly, we use Apache Commons Compress for generating the archives, and we have no guarantee that a given list of archive entries is always going to contain the same bytes of metadata even if the compressed content is the same. The upshot is that callers really should not depend on the sequence of bytes in an archive being stable in the long term, which is what the Bazel use case is asking for.
you could write a custom repository rule that runs a git clone/fetch of a specific revision to implement shawn's suggestion. Beyond fixing the direct issue, I think that would also be a good direction for Bazel to take, so Bazel can stop depending on JGit.
This is now tracked as: [1]. The change under review is: [2].
Thanks @msohn it is fixed now, as of JGit 5.1.9.
@dborowitz, @jrn, @hanwen Can this be closed?
Has this been deployed to googlesource.com?
$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
470f928f1c27777450b35cc6bf7cdce604ffe9af -
$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
ec8cd3acabbc7ff12df97064248823be0372a869 -
unfortunately, it has not, and it doesn't seem like it will be :/
Whom do we need to contact to get that fixed?
googlesource.com runs JGit from master, so if this is still non-deterministic, something else is going on.
if this is still non-deterministic
It is: note different Content-Length on different runs of trying to fetch the same commit:
$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1669011
Content-Security-Policy-Report-Only: script-src 'nonce-LMJfW5Qngj9T28V+Qzc5dw' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:01 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1668975
Content-Security-Policy-Report-Only: script-src 'nonce-IbkxLKtQPmSfur5zBvL4lg' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:05 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
This is still happening.
Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.
This has been true from the start. Unless we
a. Store the tarball when a user downloads it (this is what GitHub does), or
b. Keep around historical versions of commons-compress and record which one was used to produce the tarball
we cannot make a long term deterministic tarball download. All the requests I have seen are for use cases that require long term determinism. In that spirit, it would be misleading to pretend we intend to provide that; it is expensive to do and not part of what Gitiles is meant for.
If you don't need determinism, you can use the Gitiles tarball. If you do need determinism, I recommend storing the tarball somewhere (e.g. a cloud storage provider or an ftp host).
(a) can we make this a hosting config option? I get that storing archives for every project and every commit is a ton of space and would be pretty wasteful (especially if crawlers fire). I wonder if a middle ground of doing it only for tags would work.
(b) how big of a problem is this approach? gitiles doesn't seem to change that much (for better or worse). what if we did this? not entirely unrelated, but the gzip project has an rsync option so compressed files are stable and easy to transfer.
I'm still seeing the timestamp in the tar metadata when downloading from googlesource.com. So this is not yet resolved. It looks like it was already fixed in JGit. I added more info in #217
Why is this issue closed? The problem was never fixed. Please reopen.
Originally reported on Google Code with ID 92
Reported by
None
on 2015-12-13 16:27:54