Stream the parallel xz/gz tarball generation

This melds the serial-Tee and parallel-batched approaches from before and after commit adea17e. Now we can get the same multithreaded speedup without having to build the entire uncompressed tarball in memory first.

The new impl Write for RayonTee uses rayon::join to split the compression work for each buffer to separate threads. This is scoped, so it can be fully zero-copy, sharing the input buffer directly. This is all wrapped in a 1 MiB BufWriter to balance the cost of thread wake-ups and synchronization.

The net performance is unchanged, using around 125% CPU -- approximately 4:1 time spent in xz versus gz. The overall memory use is much reduced, now independent of the tarball size -- just a few MiB on top of the fixed-cost 674 MiB compressor memory requirements of xz -9.

Fixes #75.

rust-lang / rust-installer

Stream the parallel xz/gz tarball generation #76