golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.57k stars 17.61k forks source link

cmd/distpack: offer Zstandard-compressed archives in addition to gzip #62446

Open cespare opened 1 year ago

cespare commented 1 year ago

This is inspired by #62445, where @dsnet proposes using zopfli to create ~6% smaller .gz downloads for Go release downloads.

As he writes in that issue:

zstd is well-positioned to take over as the defacto compression format, but that probably won't happen for another decade.

This proposal is to help usher in that future by offering zstd downloads in addition to gzip.

Here's a very quick'n'dirty comparison of compression performance on the same go1.21.0.linux-amd64.tar.gz archive Joe looked at:

file size cmp ratio vs. orig CPU time
.tar 223.0 MB
orig .gz 66.7 MB 29.9% 5-7s[^1]
gzip -9 65.6 MB 29.4% -1.6% 20s
zopfli .gz 62.6 MB 28.1% -6.1% 15min on Joe's machine
zstd 3 63.6 MB 28.5% -4.6% 800ms
zstd 7 58.4 MB 26.2% -12.4% 2.4s
zstd 12 52.0 MB 23.3% -22.0% 7.7s
zstd 19 44.5 MB 20.0% -33.3% 64s

Also, decompressing the .zst archives takes about 4x less CPU time than decompressing the .gz archives on my machine.

If we offered .gz and .zst, people who care at all about size and speed can just use .zst and get a much bigger benefit than if we had zopfli-encoded .gzs.

[^1]: This is an estimate based on the fact that the file size falls between gzip -5 and gzip -6. I think that the actual release process uses compress/gzip which is quite a bit slower.

ianlancetaylor commented 1 year ago

CC @golang/release

heschi commented 1 year ago

Those are pretty compelling numbers. At least on my machine, with tar 1.34, tar -xf works just as well on .tar.zst, so I don't see any downsides to doing this other than some UI clutter on go.dev/dl.

The implementation detail is not so trivial. Creating release archives is now the responsibility of https://cs.opensource.google/go/go/+/master:src/cmd/distpack/pack.go, and we want them to be completely deterministic, which means using a compression algorithm that we can hold constant for the lifetime of a Go release. (See the associated blog post). We'd need to pull a zstd implementation into the distribution, either as a standard library package (unlikely), an internal package we own (time-consuming to write, unless someone wants to contribute it), or vendor something that looks solid (seems fine?).

Overall I'm in favor of this, it seems like a moderate amount of effort and pretty much a pure win for users.

dsnet commented 1 year ago

We'd need to pull a zstd implementation into the distribution, either as a standard library package (unlikely), an internal package we own

Alternatively, rather than freezing it at the Go package layer, you could rely on os/exec, and freeze it at the binary level of which zstd (or zopfli for #62445) binary you use.

ianlancetaylor commented 1 year ago

@heschi Just a note that there is a package that we could vendor if we go that route: github.com/klauspost/compress/zstd.

klauspost commented 1 year ago

FWIW github.com/klauspost/compress/zstd compresses it to 43873902 bytes with the best compression setting. That is 43.87MB in ~8.3s.

But to be fair it does have a bigger window size. Without the same it is 49.83MB - but there isn't too much reason to have the small window, if you are that resource constrained just use gz.

rsc commented 1 year ago

As Heschi notes, the relevant code needs to live or be vendored into the Go tree so that we can reproduce the archives bit-for-bit even far into the future. We could do that, but it increases the cost. Shelling out to a separate tool that isn't versioned in the Go repo is not an option. We'd also have to update gorebuild to verify zstd as well.

In the long term we may end up with zstd vendored anyway, or perhaps even added to the standard library. I'm OK with vendoring it for use in cmd/dist.

That said, it will require work on the release team's part, and we may not have bandwidth for reviewing and deploying such a change in the near future. But in the abstract it sounds reasonable to me.

heschi commented 1 year ago

If someone's interested in moving this forward, I think the steps are to vendor a zstd implementation, add support to cmd/distpack, and update our release automation to also publish the new files. If someone does the first two pieces I think the release team can find the time to do the latter.

There are two other kinds of artifacts not covered by this proposal: Windows distribution archives and toolchain module files, both .zip files. Wikipedia says that zip standardized zstd support a few years ago, so it's theoretically possible to make this change to both.

For Windows, it would be interesting to survey implementations and see how usable a more advanced compression would be.

For the toolchain module files, we'd need to teach the Go command to understand them, and (per discussion with Russ) probably start publishing a second series of archives, v0.0.2 rather than v0.0.1. Since toolchain upgrades will increasingly be done via the Go command, these are arguably the most important to optimize. But perhaps we should start by getting experience with the release archives.

klauspost commented 1 year ago

Wikipedia says that zip standardized zstd support a few years ago, so it's theoretically possible to make this change to both.

Yeah; No. Using the Windows 11 built-in extraction tool s with zstd in a ZIP file just gives an Error 0x80004005: Unspecified error. 90% of users will use that for extraction.

rsc commented 10 months ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 10 months ago

Are there any objections to adding this?

rsc commented 10 months ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

Add .tar.zst archives anywhere we generate .tar.gz archives in cmd/distpack. We would not add zstd-enabled zip files because windows zip readers can’t handle them.

In the longer term, this could be a step toward zstd-compressed modules, but that would require changing many more moving parts and is not in scope for this specific proposal.

rsc commented 10 months ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

Add .tar.zst archives anywhere we generate .tar.gz archives in cmd/distpack. We would not add zstd-enabled zip files because windows zip readers can’t handle them.

In the longer term, this could be a step toward zstd-compressed modules, but that would require changing many more moving parts and is not in scope for this specific proposal.

mvdan commented 8 months ago

In the longer term, this could be a step toward zstd-compressed modules, but that would require changing many more moving parts and is not in scope for this specific proposal.

Out of curiosity, would the thinking there be to keep the module archives as ZIP, but swap the compression algorithm to zstd, or to switch to something else entirely like .tar.zst?

The latter is more standard in terms of zstd compression, and will give a better compression ratio since all files are compressed together, but we would lose the ablity to seek through files without decompressing. I suspect that's not a problem, given that GOPROXY serves go.mod files separately, and GOMODCACHE already extracts the entire module archives for use in cmd/go.