njsmith / posy

289 stars 17 forks source link

PyBi: consider tar instead of zip #1

Open lordmauve opened 1 year ago

lordmauve commented 1 year ago

PyBi artifacts are specified to be zip files.

One problem with zip files is that the TOC is at the end of the file, which means that they do not support streaming decompression/extraction. This will limit how fast they can be installed even on fast networks.

Tar files are designed for streaming extraction and don't have this problem. They also support symlinks natively instead of needing the workaround in the spec. And they use stream-level compression which means they support arbitrary compression schemes.

(Full disclosure: I just implemented something very similar to PyBi internally to my firm and I get sub-second installs with something like curl https://.../ | zstd -d | tar x. Zstd is right for us with on-prem caches, but over the Internet small sizes are preferable.)

njsmith commented 1 year ago

Hey, how's it going :-)

Zip files do have some kind of headers on each file to let you process them in a single pass: https://docs.rs/zip/latest/zip/read/fn.read_zipfile_from_stream.html I've never tried it, but it sounds like the only thing you're missing is info on symlinks+executable bits, and those ought to be easy to fixup at the end when you see the "central directory". And I think pipelining the download + extraction only gives you a 2x speedup at best? Certainly nice to have, but when we're talking about sub-second times then the absolute speedup is pretty small.

Also, I think symlinks are equally native in both zips and tars, i.e., they both have a standard way to represent them the normal command-line tools already support. The draft spec goes into detail because historically wheel tools like pip haven't bothered implementing this and it's slightly annoying to go look up the details, but it's not a big deal.

It's also very handy that pybis support random access, e.g. so you can extract the METADATA file before installing. Also, this lets you do some cute tricks to fetch METADATA without downloading the whole file – which will hopefully stop being useful once https://peps.python.org/pep-0658/ is deployed, but for now it's kind of important.

The worst part about zip files is the poor compression ratios. The spec allows fancier algorithms like zstd or lzma, but in practice most tools don't support this – and even if they did, the compression ratio would still be poor compared to tarballs, b/c of how each file is compressed separately. The best of both worlds would be if the zip spec had a way to include a zstd dictionary to use when decompressing, so you can still do random access but with whole-file-like compression-ratios... but alas, this isn't standardized at all.

Anyway... overall I feel like both options have some advantages, but none of them are overwhelming; both ways can work. And given that wheels are already committed to the zip format, and we have lots of existing tooling around that, I think it's best to keep things consistent. And if we want to come up with a better format for both wheels and pybis, then that might be a great idea, but probably better to factor it off into its own project instead of trying to do everything at once :-)

rbtcollins commented 1 year ago

I strongly suspect any bottlenecks on installation are going to be something other than the lack of streaming. For instance we had massive performance challenges with Rustup on Windows until we both moved all syscalls to a threadpool and also got rustup whitelisted by MS defender to avoid thrashing the CPU during doc extraction. Rustup installs from tar files FWIW and our current performance challenge is tar's requirement for serial processing - our packages are size optimised.

So I'd suggest looking at reading the directory then parallel unpacking all the files from the archive, and looking closely at IO effects and the like.