rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.33k stars 2.33k forks source link

Migrate to alternative compression for crates.io crates #2526

Open alexcrichton opened 8 years ago

alexcrichton commented 8 years ago

Why?

tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!

The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use flate2 which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compiling flate2 against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.

First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.

stat val % smaller
miniz 3776697673 0.0
zlib 3776147960 0.01
xz 2411082764 36.16

Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3

stat val % smaller
miniz 3502228200434 0.0
zlib 3501544526571 0.02
xz 2373770137784 32.22

Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.

stat val ns per byte % slower
miniz 118891793644 31.484 0.0
zlib 118952441288 31.501 0.05
xz 144860343353 60.081 90.83

Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.

First up we've got:

time = bytes / BW + bytes * time_per_byte

So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:

bytes / BW + bytes * 31.484 = bytes * (1 - .3616) / BW + bytes * (1 - .3616) * 60.081
            1 / BW + 31.484 = (1 - .3616) / BW + (1 - .3616) * 60.081                
            1 + BW * 31.484 = (1 - .3616) + (1 - .3616) * 60.081 * BW                
            1 + BW * 31.484 = .6384 + 38.36 * BW                                     
                      .3616 = 6.876 * BW                                             
                    0.05258 = BW                                                     

Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.

So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.

Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as cargo publish is super rare.

How?

Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that cargo publish will create both tarballs and upload them.

The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.

Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).

I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)