tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!
The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use flate2 which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compiling flate2 against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.
First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.
stat
val
% smaller
miniz
3776697673
0.0
zlib
3776147960
0.01
xz
2411082764
36.16
Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3
stat
val
% smaller
miniz
3502228200434
0.0
zlib
3501544526571
0.02
xz
2373770137784
32.22
Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.
stat
val
ns per byte
% slower
miniz
118891793644
31.484
0.0
zlib
118952441288
31.501
0.05
xz
144860343353
60.081
90.83
Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.
First up we've got:
time = bytes / BW + bytes * time_per_byte
So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:
Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.
So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.
Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as cargo publish is super rare.
How?
Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that cargo publish will create both tarballs and upload them.
The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.
Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).
I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)
Why?
tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!
The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use
flate2
which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compilingflate2
against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.
Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3
Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.
Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.
First up we've got:
So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:
Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.
So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.
Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as
cargo publish
is super rare.How?
Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that
cargo publish
will create both tarballs and upload them.The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.
Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).
I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)