More details at http://kmkeen.com/gz-sort/
perform a merge sort over a multi-GB gz compressed file
git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h
Needs the zlib headers and probably only builds on GNU/Linux.
use: gz-sort [-u] [-S n] [-P n] source.gz dest.gz
options:
-h: help
-u: unique
-S n: size of presort, supports k/M/G suffix
a traditional in-memory sort (default n=1M)
-P n: use multiple threads (experimental, default disabled)
-T: pass through (debugging/benchmarks)
estimating run time, crudely:
time gzip -dc data.gz | gzip > /dev/null
unthreaded: seconds * entropy * (log2(uncompressed_size/S)+2)
(where 'entropy' is a fudge-factor between 1.5 for an
already sorted file and 3 for a shuffled file)
S and P are the corresponding settings
multithreaded: maybe unthreaded/sqrt(P) ?
estimated disk use:
2x source.gz
Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too.