google / gitiles

A simple browser for Git repositories.
https://gerrit.googlesource.com/gitiles/
Other
582 stars 174 forks source link

Gerrit tarballs for the base packages aren't deterministic #84

Closed spearce closed 5 years ago

spearce commented 7 years ago

Originally reported on Google Code with ID 92

Downloading this URL: https://go.googlesource.com/net/+archive/d75b190.tar.gz?dummy=/golang-net-d75b190.tgz
always produces a different tar archive. Contents are the same, but tar metadata is
different. This prevents using such URLs as fingerprinted distfiles in FreeBSD.

It might be that https://go.googlesource.com uses BSD tar that suffers from this bug:
libarchive/libarchive#623 (in this case the fix is switching to GNU tar).

Reported by None on 2015-12-13 16:27:54

spearce commented 7 years ago
Downloading this URL: https://go.googlesource.com/net/+archive/d75b190.tar.gz?dummy=/golang-net-d75b190.tgz
always produces a different tar archive. Contents are the same, but tar metadata is
different. This prevents using such URLs as fingerprinted distfiles in FreeBSD.

It might be that https://go.googlesource.com uses BSD tar that suffers from this bug:
libarchive/libarchive#623 (in this case the fix is switching to GNU tar).

Reported by None on 2015-12-13 16:27:54

spearce commented 7 years ago
Thanks for filing this bug. Its not Gerrit, but Gitiles that generates these (and even
then I think its JGit's fault). But its certainly not Gerrit. So I'm going to move
this over to the Gitiles project where its more likely to get attention.

Reported by None on 2015-12-14 07:03:42

shahms commented 7 years ago

FWIW, this makes it somewhat annoying to follow Bazel best practices when using http_archive for external dependencies hosted on googlesource.com. Specifically, the best practices are to use http_archive rather than git_repository and to include a sha256sum.

spearce commented 7 years ago

FYI, this is pretty unlikely to be fixed due to needing to break a public API inside JGit, which requires a major version bump, and that doesn't happen often.

The best practice to get files from Git is to use the Git wire protocol to git clone the repository. If only a single version is needed, use --depth 1 to get a shallow clone.

If Bazel doesn't want to use Git to fetch source files from Git, then best practice should be to export the files as a tarball and store that tarball in another, non-Git persistent location where the exact bytes of that stream are unlikely to change.

Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.

shahms commented 7 years ago

Bazel can use git directly, but it doesn't support shallow clones and therefore unnecessarily fetches all of the history for a repo. Their suggestion is to use http_archive to fetch a tarball for this use case.

dborowitz commented 7 years ago

IMHO there should be a feature request against Bazel to support shallow clones. It should be trivial to add the --shallow flag.

As Shawn says, using a dynamically generated compressed file is still a bad idea for this use case. Even if we fix JGit/Gitiles to generate a deterministic sequence of bytes at a given server version, we have no way to ensure that the given sequence of bytes remains deterministic across server versions. We may depend on the JDK's zlib implementation for compressing objects, and there is no guarantee that that implementation is going to always produce the same byte sequence across JDK versions. Similarly, we use Apache Commons Compress for generating the archives, and we have no guarantee that a given list of archive entries is always going to contain the same bytes of metadata even if the compressed content is the same. The upshot is that callers really should not depend on the sequence of bytes in an archive being stable in the long term, which is what the Bazel use case is asking for.

hanwen commented 7 years ago

you could write a custom repository rule that runs a git clone/fetch of a specific revision to implement shawn's suggestion. Beyond fixing the direct issue, I think that would also be a good direction for Bazel to take, so Bazel can stop depending on JGit.

davido commented 5 years ago

This is now tracked as: [1]. The change under review is: [2].

davido commented 5 years ago

Thanks @msohn it is fixed now, as of JGit 5.1.9.

@dborowitz, @jrn, @hanwen Can this be closed?

jheiss commented 4 years ago

Has this been deployed to googlesource.com?

$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
470f928f1c27777450b35cc6bf7cdce604ffe9af  -

$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
ec8cd3acabbc7ff12df97064248823be0372a869  -
vapier commented 4 years ago

unfortunately, it has not, and it doesn't seem like it will be :/

ryandesign commented 4 years ago

Whom do we need to contact to get that fixed?

hanwen commented 4 years ago

googlesource.com runs JGit from master, so if this is still non-deterministic, something else is going on.

ryandesign commented 4 years ago

if this is still non-deterministic

It is: note different Content-Length on different runs of trying to fetch the same commit:

$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1669011
Content-Security-Policy-Report-Only: script-src 'nonce-LMJfW5Qngj9T28V+Qzc5dw' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:01 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1668975
Content-Security-Policy-Report-Only: script-src 'nonce-IbkxLKtQPmSfur5zBvL4lg' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:05 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
mohd-akram commented 3 years ago

This is still happening.

jrn commented 3 years ago

Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.

This has been true from the start. Unless we

a. Store the tarball when a user downloads it (this is what GitHub does), or

b. Keep around historical versions of commons-compress and record which one was used to produce the tarball

we cannot make a long term deterministic tarball download. All the requests I have seen are for use cases that require long term determinism. In that spirit, it would be misleading to pretend we intend to provide that; it is expensive to do and not part of what Gitiles is meant for.

If you don't need determinism, you can use the Gitiles tarball. If you do need determinism, I recommend storing the tarball somewhere (e.g. a cloud storage provider or an ftp host).

vapier commented 3 years ago

(a) can we make this a hosting config option? I get that storing archives for every project and every commit is a ton of space and would be pretty wasteful (especially if crawlers fire). I wonder if a middle ground of doing it only for tags would work.

(b) how big of a problem is this approach? gitiles doesn't seem to change that much (for better or worse). what if we did this? not entirely unrelated, but the gzip project has an rsync option so compressed files are stable and easy to transfer.

eighthave commented 2 years ago

I'm still seeing the timestamp in the tar metadata when downloading from googlesource.com. So this is not yet resolved. It looks like it was already fixed in JGit. I added more info in #217

ryandesign commented 2 years ago

Why is this issue closed? The problem was never fixed. Please reopen.