rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.29k stars 2.32k forks source link

Migrate to alternative compression for crates.io crates #2526

Open alexcrichton opened 8 years ago

alexcrichton commented 8 years ago

Why?

tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!

The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use flate2 which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compiling flate2 against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.

First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.

stat val % smaller
miniz 3776697673 0.0
zlib 3776147960 0.01
xz 2411082764 36.16

Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3

stat val % smaller
miniz 3502228200434 0.0
zlib 3501544526571 0.02
xz 2373770137784 32.22

Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.

stat val ns per byte % slower
miniz 118891793644 31.484 0.0
zlib 118952441288 31.501 0.05
xz 144860343353 60.081 90.83

Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.

First up we've got:

time = bytes / BW + bytes * time_per_byte

So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:

bytes / BW + bytes * 31.484 = bytes * (1 - .3616) / BW + bytes * (1 - .3616) * 60.081
            1 / BW + 31.484 = (1 - .3616) / BW + (1 - .3616) * 60.081                
            1 + BW * 31.484 = (1 - .3616) + (1 - .3616) * 60.081 * BW                
            1 + BW * 31.484 = .6384 + 38.36 * BW                                     
                      .3616 = 6.876 * BW                                             
                    0.05258 = BW                                                     

Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.

So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.

Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as cargo publish is super rare.

How?

Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that cargo publish will create both tarballs and upload them.

The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.

Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).

I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)

alexcrichton commented 8 years ago

Also as another note, I don't think that platform compatibility is relevant at all. The xz2 crate I've got going works on all our platforms and we're not relying on external tools so we've got the bases covered.

larsbergstrom commented 8 years ago

The one place where you might see that sort of bandwidth is on EC2 machines. In my experience, their connection to S3 buckets is pretty unreal.

I like the strategy of having both formats and letting the cargo client choose (with a reasonable default) given that the S3 cost is negligible. I'm not worried about our platforms, but for people trying to stand up rust+cargo on some wonky new embedded platform with a strange linux variant, every new package can be a nightmare --- I'm thinking of my experiences with these "custom" ARM linux builds for PandaBoard, etc. that I've had to play with for Servo. Though, admittedly, one could argue that you should always be cross-compiling to target them instead.

julienw commented 8 years ago

Did you look at brotli as well ? I admit I don't know if it's usable like this, but it's been specifically done to be fast at decompressing and to achieve good ratio.

alexcrichton commented 8 years ago

Nah unfortunately I only tried out gzip + xz, I figure brotli would be pretty similar to gzip (not drastically smaller, but perhaps faster), but it'd be good to investigate just to make sure!

julienw commented 8 years ago

I read a little more and brotli has a predefined dictionary that's suitable mostly for html and text files. So it might not be ideal for rust crates -- yet rust crates are also mostly text after all.

zopfli could also be worth looking as it compresses into a gzip-compatible format.

raphaelcohn commented 8 years ago

When I was writing swaddle, I took a bit of a look at this. zopfli is usually beaten, slightly, by pigz -11 -T (-I can't be used, as older versions still used on Ubuntu LTS releases don't have it). [Try pigz -11 -c -q -f -n -T -- "$FILE" >"$FILE".gz for example]. Sticking to gzip also has the slight advantage that the HTTP client can be used to decompress on the fly, rather than after downloading, if the right HTTP headers are used. Amazon's CDS supports this too, IIRC. That itself can often be a major win. Does crates.io do this? I don't know. However, even better, are the rzip derivatives. Particularly lrzip. It's available now on most major platforms (although CentOS requires the EPEL repo).

raphaelcohn commented 8 years ago

Or we could support multiple connection formats, including none at all (eg for the EC2 case above). This is effectively what the Debian apt repositories do. It's why Ubuntu et al could migrate from gzip to bzip2 to lzma to xz over several major releases, without having to worry about repacking or whether the necessary more modern compression tooling was available to all users.

That said, I'd recommend choosing 2 or 3:-

luser commented 8 years ago

FYI, there's a short writeup comparing a few modern compression implementations here: https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf

Brotli seems to be the current leader for balancing compression ratio and decompression speed.

luser commented 8 years ago

Apparently Dropbox wrote a pure-Rust brotli implementation: https://github.com/dropbox/rust-brotli

gyscos commented 7 years ago

zstd is another potential choice, supporting very high decompression speed at near-lzma compression ratio at the high levels. It supports custom dictionaries, but they need to be shared by the sender and the receiver, so it'd require a specific client (making it cargo-only), so maybe not ideal.

As for pipelining download and decompression, most compression formats already allow that.

raphaelcohn commented 7 years ago

Brotli's a great compression algorithm, but it works best on textual and HTML encoded data; it's not ideal for binaries. I think the debate really boils down to:-

eternaleye commented 7 years ago

On a related note, is there any chance that the crate download filenames could include the compression algorithm? Format sniffing is error-prone and potentially risky, and I would greatly prefer foo-1.2.3.crate.xz as distinct from foo-1.2.3.crate (legacy gzip), and so forth for future formats.

kornelski commented 7 years ago

I've tested using bindgen-0.19.0.crate as an example:

zopfli: 5369537 xz -9: 3944776 xz default: 3951300 brotli default: 4025797

BTW: I wouldn't mind if support for the new format was tied to particular Rust version (i.e. when publishing I could choose to opt in to the new format). If I were to publish a package that uses language features from Rust 1.n, then I could as well require Cargo 1.n for decompressing it.

eternaleye commented 7 years ago

@pornel

I'd really, really, really rather not. Especially as there's currently no way for Cargo to tell that any given crate requires Rust 1.n.

kornelski commented 7 years ago

My reasoning is that the package won't work either way, so whether it fails to decompress or fails to build, it's still incompatible. A way for cargo to know required rust version could be added.

gyscos commented 7 years ago

@pornel: How did you get these numbers? The bindgen-0.19.0.crate uncompressed is 269312 bytes here, maybe I have the wrong file?

kornelski commented 7 years ago

Oops, I've had dirty local directory containing massive libclang.dylib.

gyscos commented 7 years ago

For small files, brotli wins thanks to its dictionary. The bindgen-0.19-crate file (as built by cargo package) is a good example:

Compression used Size in bytes
No compression 269312
gzip -9 45969
xz -1 44020
xz -9 40132
zstd --ultra -22 41406
bro 38831

EDIT: This brings a good point: what is the average size of crates on crates.io? With the trend of having many smaller crates, compressing small files looks like a common case.

llogiq commented 7 years ago

I recently blogged about this stuff. Though a bit more complicated (especially on the server side), I strongly suspect that delta compression (e.g. BSDiff or Courgette) will give us even bigger wins than using brotli.

gyscos commented 7 years ago

Delta updates don't apply to the first download though, which is quite frequent.

adam-azarchs commented 2 years ago

Because I was curious, I checked how different compressors/decompressors worked out on what's currently in my $CARGO_HOME/registry/cache, which is probably a reasonably representative sample[^1]. The totals, across all of those crates:

Codec version Compressed size (MB) Compression time (s) Decompression time (s)
Uncompressed N/A 967 N/A N/A
As downloaded (gzip -9) N/A 182 N/A 8
zopfli (default settings) 1.0.3 173 6937 483
brotli -9 1.0.9 146 287 7
zstd --ultra -22 1.5.0 139 2984 9
xz -9e 5.2.5 133 1448 22

Things like diff compression might make sense for saving space on the server size, but are not terribly useful on the user end of things, particularly for e.g. CI systems which are frequently set up to download everything from scratch every time.

Conclusions:

  1. zopfli offers small improvements in compression ratio, but at a very significant cost in both compression and decompression times. It does have the (significant) advantage of backwards compatibility, however, which is really the only reason it exists.
  2. brotli does significantly better than gzip (including zopfli) in terms of compression ratio[^2]. While it does not achieve the highest compression ratios, it is very significantly faster for compression and edges out the others in decompression speed as well.
  3. zstd tends to be a very good trade-off between speed and compression ratio in a lot of situations. However, when you push its compression settings to the max like this, it still can't catch up to xz and becomes rather expensive to compress. It does, however, beat xz in decompression speed, at the cost of only a little size[^3].
  4. xz achieves the highest compression ratio, while still being relatively reasonable in terms of the amount of time it takes to compress things. However, it is a little slower on the decompression end. For most reasonably expected internet connections, people would probably be better off with slightly larger zstd or brotli archives which decompress faster.

[^1]: For each of the 1303 crate archives in that directory, I decompressed it, then re-compressed it with each of the codecs being discussed, and then timed decompression from there as well. The benchmark was performed on a laptop with an Intel Core i9-9980HK and 64GB of memory running Linux.

[^2]: This shouldn't be much of a surprise, since crate archives are mostly source code, and brotli was optimized for web content which is also mostly source code - in particular, javascript looks awfully similar to rust sources, as far as a compression codec is concerned anyway.

[^3]: One could probably eek a little more out of zstd by using its functionality for "training" a dictionary that can be pre-shared, borrowing a trick from brotli but trained on a dataset that is specific to your problem domain. The significant downside of doing that, of course, is that it means people wouldn't be able to inspect the crate archives using standard tools without finding the specified dictionary file somewhere.

julienw commented 2 years ago

I'm surprised about the decompression time for zopfli, shouldn't it be close to gzip ?

adam-azarchs commented 2 years ago

I was also surprised. In my past experiences decompressing from zopfli has usually been if anything faster. It's possible the issue is with my benchmarking script, which for convenience used the same executable for compression and decompression; zopfli may just be bad a decompression and I'd have gotten different results with gunzip.

soloturn commented 1 year ago

related is dropbox re-implementation of brotli in rust, with its new name broccoli, and the broccoli protocol which they are using around it to have concatenable brotli streams:

@llogiq , @adam-azarchs when you write "diff compression" this should solve it? @alexcrichton what you think?

adam-azarchs commented 1 year ago

Brotli is unrelated to diff compression. "Diff compression" refers to computing the diff between one version of something and another and only sending that difference (which could itself be compressed, of course). However there are several problems with using diff compression for this use case:

  1. If you skip a version, you need to download, decompress, and apply several diffs sequentially. Or the server needs to maintain diffs between a quadratically-growing number of different version combinations.
  2. Checksum verification becomes more complicated. Either you need to maintain the checksum for the baseline and every diff, or you need to apply the diff before computing the checksum.
  3. Decompressing a baseline archive and one or more diffs, and then applying those diffs, is guaranteed to be slower than simply decompressing an archive of the final content.
  4. The whole thing adds significant complexity. That's not an argument against it in itself, but does mean that the benefits need to be compelling.
  5. When I update a crate on my dev machine, there will be the earlier version to diff against, which could improve size for that crate, but there's also a lot of other crates which haven't been touched at all. However a CI system that is doing a clean build every time (e.g. GitHub actions) will need to download every crate fresh, with no baseline. Because it's downloading every crate, I strongly suspect that those kinds of things will constitute the vast majority of total downloads. In those cases, downloading a baseline + one or more diffs would actually increase the total amount downloaded in most cases.

Particularly due to that last point, download of complete archives needs to be considered the "normal" case, and needs to be well-supported. Diff compression might be a fun optimization, but can't really have priority over optimizing the common case.

epage commented 9 months ago

Since a lot of this early discussion, we gained the package.rust-version field and we can only upload the new format when the MSRV is high enough.

adam-azarchs commented 9 months ago

I redid the above comparison with somewhat newer tools and a bit more care to insulate things from the rest of what was going on on the system. I omitted zopfli this time because it's a bit awkward to include (it doesn't work on streams, only files on disk, because it needs to make repeated passes).

Codec Compressed Size (MB) Ratio Decompres time (s) Compress time (s)
Uncompressed 3252.5 1.00 N/A N/A
lz4 802.4 0.25 6.5 14.9
zstd --fast 630.0 0.19 8.2 15.3
lz4 -9 609.2 0.19 6.4 96.2
gzip 525.8 0.16 18.9 88.6
gzip -9 517.0 0.16 18.8 249.4
zstd 493.4 0.15 8.9 20.1
brotli -9 398.2 0.12 10.4 316.2
zstd -19 380.3 0.12 9.7 1527.2
zstd -22 376.0 0.12 10.6 3599.5
xz 372.5 0.11 35.1 853.1
xz -9e 360.2 0.11 34.9 1785.2
brotli[^1] 357.2 0.11 11.3 7094.0

These results were streaming the content of each crate file (one at a time) from an in-memory buffer[^2], on an Amazon EC2 instance (m5, Intel(R) Xeon(R) Platinum 8259CL CPU).

Software versions:

  1. gzip 1.5
  2. zstd 1.5.4
  3. brotli 1.0.9
  4. xz 5.2.7

Compression inputs were $CARGO_HOME/registry/cache/*/*.crate[^3], for a large and probably representative sample.

I didn't redo the zopfli benchmark but it looks likely that it would end up close to brotli -11 in terms of time to compress[^4], and I'd expect it's compressed size to come in a little bigger than zstd at default settings. The reason to consider zopfli as an option would be to preserve compatibility with existing clients, since its output is still gzip; it could be implemented with a relatively isolated change to cargo publish (there's even a pure-rust re-implementation for zopfli already, though I haven't done any performance comparisons for it). A little over 2 seconds per MB is slow as far as compression codecs go in general, but given that the vast majority of crates (in my $CARGO_HOME at least) come in under 500kb, it wouldn't have a meaningful impact on the amount of time it takes to publish a new crate version.

If switching codecs is on the table, it looks to me like brotli (at the default, highest compression level of 11) is probably the way to go. The decompression speed is close enough to zstd (and quite and improvement over gzip, let alone over xz), and it wins on size as well, likely because it's able to take advantage of its pre-initialized dictionary[^5].

(edited to clarify that the statistics are computed compressing each crate's content separately, and to add a few more codecs / settings to the party)

[^1]: brotli's documented command line options only allow levels 0 through 9, however it actually goes to 11, which is the default.

[^2]: zstd's compression speed is improved quite substantially by passing --size-hint or --stream-size, especially at higher compression levels - that 3449 seconds for zstd --ultra -22 is reduced to 2069 seconds, for example.

[^3]: Because of the recent switch to sparse registry, there's some duplication between github.com-1ecc6299db9ec823 and index.crates.io-6f17d22bba15001f, which makes the statistics biased towards more recently-used crates, which I figured is probably a good thing.

[^4]: This makes sense given that brotli's source code refers to quality level 10 as ZOPFLIFICATION and 11 as HQ_ZOPFLIFICATION.

[^5]: Brotli's dictionary (RFC) was pre-trained on internet content at the time. That includes a lot of JavaScript, which is mostly English-ish words with a lot of '});, just like most rust source code. It would be interesting to see how that dictionary might be different if it was re-trained today, where a lot more of the downloaded content on the internet is JavaScript, and that JavaScript is more frequently minified, but it's unlikely that we'll see a "brotli 2" any time soon, since the existing one remains "good enough."

epage commented 9 months ago

It sounds like you are compressing / decompressing your entire cache into one file. How representative is that compared to compressing / decompressing individual .crate files? I know some things scale poorly but I don't know enough about compression algorithms to know if it includes them. For example, while compression time is the lowest priority for me, it makes it harder to evaluate the user impact when the numbers are that large and unrepresentative of what will happen for the user.

Is there much of a difference between implementations we'd use within cargo and the standalone programs tested?

Something else for us to consider is how easy it is to index into the "tarball" if we decide to support rustc reading directly from compressed files. This isn't decided yet and so not a hard requirement but it is useful to consider.

Again, I know very little about compression: someone brought up LZ4 at RustConf Unconf. Is that worth testing?

adam-azarchs commented 9 months ago

It sounds like you are compressing / decompressing your entire cache into one file.

No, I was not. That was compressing/decompressing one crate at a time. Otherwise I would expect zstd to have significantly outperformed brotli. Also zopfli (including the implementation within brotli as far as I know) is not capable of compressing files larger than 2GB.

Is there much of a difference between implementations we'd use within cargo and the standalone programs tested?

That I can't answer without knowing which implementation you'd use. If you used C / bindgen then I'd expect the performance to be the same (or maybe slightly better since you wouldn't have the process start/stop overhead) but if it's an alternative implementation then who knows?

Something else for us to consider is how easy it is to index into the "tarball"

For that, you'd probably be best off using zip rather than tar for the archive format, since that allows indexing to individual files within the archive. Though, you could use something like stargz if you wanted to retain backwards compatibility. Either way, your compression ratios won't be as good because you'll lose state between files. However, I'd expect the impact to be lower for brotli due to the pre-initialized dictionary.

Again, I know very little about compression: someone brought up LZ4 at RustConf Unconf. Is that worth testing?

Not really. lz4 is very fast for both compression and decompression but doesn't get good compression ratios. It's designed for things like transparently compressing data in an advanced filesystem like zfs or btrfs; it's not really appropriate for longer-term archival. zstd is almost as fast for decompression.

For completeness, however, here is a comparison on a subset of the crate list:

Codec Compressed Size (MB) Ratio Decompres time (s) Compress time (s)
Uncompressed 2238.2 1.00 N/A N/A
zstd --fast 447.1 0.20 6.2 11.4
lz4 -9 433.1 0.19 4.8 69.9
zstd 350.8 0.16 6.7 15.7
adam-azarchs commented 9 months ago

For the curious, I created this comparison using the little go "script" I'm attaching here (go, because it's quick to iterate with, is easy to inject parallelism, and doesn't require any dependencies beyond the standard library for something like this). Disclaimer: this was just something I whipped up quickly to generate these comparisons; I make no warrantee as to its robustness or maintainability. It's named .txt here due to limitations github places on the extensions for files attached to comments: comp_bench.go

epage commented 8 months ago

The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.

Alternatively, we could checksum before compression as requested in #7212 and we can support as many different compression algorithms we want without adding a new field each time.

llogiq commented 8 months ago

That might risk decompression bombs (or malware using a decompressor vulnerability, because there is no way to ensure a valid download before decompression.

Eh2406 commented 8 months ago

Yes. We are going to need more than one hash. The index is going to need to provide a "the actual artifact you're about to download has the hash". Separately we will need the ability for lock files to check "this package is the same as last time we built, even if we downloaded it under different compression".

epage commented 8 months ago

I take it our existing zip bomb mitigations aren't sufficient?

adam-azarchs commented 8 months ago

Security vulnerabilities in decompression codecs are not unheard of, so It's nice to be able to verify integrity of the downloaded artifact before decompressing it. It also makes it easier to implement things like content-addressable caches, since that way they don't actually need to be able to understand the compression format.

llogiq commented 8 months ago

@epage it would at least raise the security level by being able to verify the authenticity of the download before doing anything with it. Of course crackers could still compromise the server, replacing both binaries and checksums, but then we'll have bigger fish to fry anyway.

adam-azarchs commented 8 months ago

Checksums are stored in Cargo.lock, so I'd think a compromise of that sort for the registry would be detected very quickly.

wbcat commented 2 weeks ago

The author of lzip went great length to make extreme robust integrity checking, which makes it more suitable for long term storage than xz. See xz inadequate.

I have been using lzip for quite some time especially the tarlz tool.

Compression size and speed are on par with xz, but outperformed by zstd.

My current usage pattern:

kornelski commented 2 weeks ago

For crate sizes, see https://lib.rs/stats#crate-sizes — almost all crates are under 50KB. Majority is under 10KB.

The really large ones typically contain binary executables, libraries, ./target dir, or a lot of autogenerated code for bindings or parsers.

adam-azarchs commented 2 weeks ago

This is true. Looking at my own $CARGO_HOME cache as maybe a representative sampling, the 10 largest (which is really only 6 because there were multiple versions of some) are

crate compressed size (MB) decompressed (MB) notes
windows 12 157 I think falls into the "a lot of autogenerated code" category (including some multi-megabyte .rs files). That's frankly never going to perform well.
image 8.8 14 seems mainly down to including its tests and examples directories. This is the kind of thing that gives https://github.com/the-lean-crate/criner a reason to exist.
plotters 8.2 42 nearly all of the size is in a plotters-doc-data that appears to be full of images that are referenced in markdown/html from docstrings and may not have even been intentionally checked out in the repo at the time the crate was packaged. Also includes e.g. Cargo.toml.orig...
lapack-sys 7.1 73 includes a lot of little files from lapack C code, including a few which I don't think ought to be there like for example both Makefile and CMakeLists.txt (please, just pick one...). But also 24M worth of directories named testing or TESTING.
openssl-src version 111.25.2 4.9 27 mostly in openssl/crypto because eliptic curve definitions are unavoidably big.
ring 4.9 15 1.9M of tests, but most of the size is in third_party/NIST/SHAVS, which is again unavoidable.

So if your point was that perhaps rather than looking at compression algorithm there should the more emphasis on reducing unnecessarily included files, I'd mostly agree with you. But, only mostly, because of those 6 crates, one was windows and another was openssl-src, both of which are fairly critical crates in the rust ecosystem, and neither of which have a strightforward route to being reduced in size. I think size isn't such a big concern but moving to a more modern compression algorithm such as zstd could improve performance considerably.

As for what compression algorithm to use, I think the argument for zstd is pretty strong (even ignoring the recent tarnish to xz's reputation security-wise): zstd -22 compresses just about as well as xz but is 3x faster to decompress. Compared to gzip, it's 72% of the size and decompresses in half the time.

A recent golang issue to add zstd support does a decent job of summarizing why it meets the fairly high bar for inclusion:

[^1]: which xz is not, as the lzip authors point out. lzip is well-specified, at least.

lzip is fine I guess but definitely not seeing the kind of backing and adoption that zstd is seeing, e.g. browser Content-Encoding or .deb archives in the repos for recentish major versions of Debian and Ubuntu. The robustness checks and recoverability are nice but I think the usefulness is limited in a context where the archives are stored in a major cloud service with redundancy across multiple geographic zones. Integrity checking is useful, of course; zstd does optionally include an integrity check in its blocks, using xxhash, but if it's corrupted, realistically you're just going to download a fresh copy.

I'd probably also take the opportunity to rejigger the format a little bit to make it easier to extract Cargo.toml and maybe the license file (if applicable) without decompressing the whole archive (possibly as simple as requiring those to be the first elements in the archive). Not sure there's much of a use case for being able to extract other arbitrary files from the archive.

kornelski commented 2 weeks ago

My conclusions from crate sizes are:

I think this favors Brotli as the compressor, which performs especially well on small text files (and still fine on binary files).

Other arguments in favor of Brotli:

adam-azarchs commented 2 weeks ago

Theory is nice, but brotli was included in my benchmarks above. The results speak for themselves.

kornelski commented 2 weeks ago

I don't see zstd being pareto-optimial on all three variables, so the conclusion depends on which criteria you pick.

If prioritizing only file size, then brotli -11 wins by a significant margin, while still being faster than gzip to decode, and encode time is less than 2x of zstd's maximum. Use of the most expensive compression setting isn't irrational, given that the majority of crates is below 50KB. Publishing is a one-time operation, while the data is downloaded many more times and stored indefinitely.

For a certain balance of file size & decoding time, zstd -19 has a slight advantage in this set, but this comes at a 5x increase in encoding time compared to brotli -9, and it's inconclusive whether a higher brotli level could catch up on file size with less than 5x increase in encoding time.


The file sizes involved here differ by multiple orders of magnitude, which makes proper analysis much harder. Datasets with such large differences are likely to show the Simpson's paradox (you may get different conclusions whether you compare compression on pairs of files, or per bucket of similar file sizes, or just add up sizes of all files first and then compare)

A single total/average is likely affected strongly by outliers — de/compression time increases superlinearly with window size (which is forced to be small on small files), so a handful of largest files can dominate the timings. A single total also can't disprove that Brotli compresses small files better. Brotli could be better on all files except the largest few, and lose on the summed total because of them.

folkertdev commented 2 weeks ago

I was curious what impact a custom dictionary would have on the zstd numbers. The main reason that brotli is able to compress so well is its dictionary. zstd makes it easy to train a custom dictionary on a collection of files.

Then especially for small crates, zstd appears to achieve better compression:

zstd           : 17.62%   ( 22528 =>   3969 bytes, ./mimalloc-0.1.42.crate.raw.zst) 
zstd with dict : 11.39%   ( 22528 =>   2567 bytes, ./mimalloc-0.1.42.crate.raw.zst) 
brotli         : 15.16%   ( 22528 =>   3416 bytes, ./mimalloc-0.1.42.crate.raw.br)
===
zstd           : 16.06%   (323072 =>  51874 bytes, base64-0.13.1.crate.raw.zst) 
zstd with dict : 14.20%   (323072 =>  45869 bytes, base64-0.13.1.crate.raw.zst) 
brotli         : 14.39%   (323072 => 46481 bytes, base64-0.13.1.crate.raw.br)
===
zstd           : 13.45%   (381952 =>  51376 bytes, borsh-1.5.0.crate.raw.zst) 
zstd with dict : 11.88%   (381952 =>  45369 bytes, borsh-1.5.0.crate.raw.zst) 
brotli         : 12.58%   (381952 => 48052 bytes, borsh-1.5.0.crate.raw.br)

Note: it is quite plausible that at least for mimalloc there is some overfitting happening here. The dictionary is about 100k so for the others that is less likely.

As the inputs get larger, this gap narrows and eventually brotli wins out

zstd           : 12.23%   (4192256 => 512558 bytes, tokio-1.34.0.crate.raw.zst) 
zstd with dict : 13.12%   (4192256 => 550107 bytes, tokio-1.34.0.crate.raw.zst) 
brotli         : 11.89%   (4192256 => 498616 bytes, tokio-1.34.0.crate.raw.br)

In this case the dictionary version also gives worse results than zstd without a dictionary. That's weird because at least in gzip, not providing a dictionary is equal to providing a dictionary of all zeros.

In any case, the dictionary provides another level of tuning to zstd, that appears to be beneficial wrt. size for the majority of crates.

scripts to generate this output from a folder of .crate files are provided below:

Hacky bash scripts ```sh for file in *.crate; do # Rename the file to .crate.gz mv "$file" "$file.gz" # Decompress the .gz file, keeping the original gzip -d -c "$file.gz" > "${file%.crate}.crate.raw" done ``` train a dictionary on all `.crate.raw` files in the current directory ```sh zstd --train -o raw-crates.dict *.crate.raw ``` this runs zstd and brotli on the .crate.raw files ```sh for file in $(find . -maxdepth 1 -name "*.crate.raw" -exec stat --format='%s %n' {} + | sort -n | cut -d' ' -f2-); do echo "===" # Original file size original_size=$(stat --format='%s' "$file") zstd --ultra -22 "$file" -f zstd --ultra -22 "$file" -f -D raw-crates.dict brotli -Z "$file" -f brotli_compressed_file="${file}.br" brotli_size=$(stat --format='%s' "$brotli_compressed_file") brotli_ratio=$(awk "BEGIN {printf \"%.2f\", $brotli_size / $original_size * 100}") echo "$file : $brotli_ratio% ($original_size => $brotli_size bytes, $brotli_compressed_file)" done ```
adam-azarchs commented 2 weeks ago

For a fair comparison you shouldn't train the dictionary on the same set of crates you're using to measure compression ratio, though it would be fair for there to be some overlap, since presumably the dictionary would we trained on a sample of the most popular crates.

A zstd dictionary is basically just initializing the stream state. It shouldn't make much difference at all for larger files; once you're past the context window, the initial state is irrelevant.

adam-azarchs commented 2 weeks ago

There's a major downside to using a shared dictionary with zstd, which is that crates are no longer convenient to inspect or manipulate with standard tools other than cargo itself. That little benchmark script is a case in point - you can just do gzip -c < foo.crate, but if you used a shared dictionary with zstd, you'd need to find that shared dictionary and pass it to the command line. Brotli's shared dictionary isn't going to be as well optimized for rust code, but it's baked into the specification for the format so there's no question about where to find it.

I also think it's a bit misleading to say that "almost all crates are under 50KB. Majority is under 10KB." Yes, this is true. It's also true that a majority haven't updated in over a year, and any metric for "popularity" is dominated by a very tiny minority. I also think that there's a strong argument that the compression ratio for small crates doesn't matter in most practical senses.

The only case where all crates should be considered equally is if you care about the storage costs for hosting the crates.io registry, in which case you should probably round everything up to a floor of 128 KB, which is the minimum billable size for an s3 object. I don't think anyone is arguing that this is what should be prioritized, especially given that, for a good long while after such an update was made, crates.io would likely need to be hosting crates in both the gzip format and whatever was settled on to replace it (either that or convert on the fly for older clients, but that costs compute and has to be done very carefully to ensure that checksums remain stable). If you really want to cut storage costs on crates.io then zopfli is probably the way to go, at least for the medium term, because it doesn't pose such backwards compatibility issues.

If you care about the space taken up by $CARGO_HOME/registry/*/cache in developer's machines, you should round everything up to to the next 4 KiB block boundary (some filesystems use larger blocks, but not many use smaller blocks), and you should weight by all-time downloads (the ordering of which is pretty similar to that for recent downloads).

If you care about network bandwidth (either for developers or egress charges for crates.io's hosting) then you should weight by recent downloads (and if you want to be a pedant, also consider the fixed per-download overhead, which will dominate for the smallest crates). The top recently-downloaded crates on crates.io are

crate recent downloads (millions) compressed size (KiB)
syn 75 259
hashbrown 57 138
bitflags 57 42
base64 49 80
regex-syntax 49 339
proc-macro2 47 48
indexmap 46 81
quote 44 28
regex-automata 44 603
libc 43 726
serde 41 76
itertools 39 143
memchr 39 94

If you care about impact on build times, then you should again weight by recent downloads and consider the compressed size (assuming some download speed, and again probably assuming some fixed per-crate overhead) and decompression speed.

kornelski commented 2 weeks ago

Good point about the weighing. I've added tarball_size * downloads to the stats: https://lib.rs/stats#crate-bandwidth

adam-azarchs commented 2 weeks ago

Good point about the weighing. I've added tarball_size * downloads to the stats: https://lib.rs/stats#crate-bandwidth

Nice! That's very interesting data! From there, we can see that for example the majority of crates have 1-10MB worth of downloads per month, so totaling O(100GB) for that bucket. I wouldn't be surprised if a lot of those are crates which basically no one but their author uses. Meanwhile there 482 crates which total at least that much each, including 8 which download > 10TB each (which is pretty impressive really). The list of crates in that bucket isn't terribly surprising given what I've noted above - libc is one of the most-downloaded crates, and is relatively large, and windows, openssl-src, and ring are all quite large and still popular.

Breaking it down by bandwidth used per bucket (very approximately, given I don't know the distribution of sizes within each bucket; but assuming they follow the same trend can probably assume the average is about 2x min),

bucket crates total (TB)
1-10MB 48635 0.1
10-100MB 25706 0.5
100MB-1GB 11062 2
1 - 10GB 4487 9
10-100GB 1413 30
100GB-1TB 406 80
1 - 5TB 53 80
5-10TB 15 100
>10TB 8 100

Put another way, there's only around 25 crates that are responsible for around half of the total bandwidth on crates.io. Impressive! (though, not necessarily in a good way. I hope those crates at least have put some effort into trimming out anything unnecessary but I'd be willing to bet that at least some of them have unnecessary content)

Of course, a lot of downloads are in CI (e.g. GitHub actions, if the user hasn't set up caching properly), where bandwidth is plentiful and latency isn't so critical. So if I had time to work up another representative benchmark (I don't, at least not this week), I'd probably be looking at

  1. Crates with at least 5M "recent" downloads (~500)
  2. Crates with at least 30M all-time downloads (~500, with obviously a lot of overlap with 1)
  3. Crates with >100GB worth of monthly downloads (482, probably also with a lot of overlap with 1)
kornelski commented 1 week ago

here's bandwidth from recent downloads & size of the latest release

top-bandwidth.csv

zleyyij commented 1 week ago

Just wanted to throw my chip in, for zstd, the pure rust crate doesn't have an encoding implementation because the specification doesn't directly specify the algorithm, just provides different ways to encode chunks of data, so implementing it would require a lot of work (referring to the source code for facebook's library if you wanted to come anywhere close in terms of speed and compression ratios).