keenerd / gz-sort

A utility for sorting really big files. http://kmkeen.com/gz-sort/
GNU General Public License v3.0
91 stars 5 forks source link

More details at http://kmkeen.com/gz-sort/

perform a merge sort over a multi-GB gz compressed file

Quickstart

git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h

Needs the zlib headers and probably only builds on GNU/Linux.

use: gz-sort [-u] [-S n] [-P n] source.gz dest.gz

options:
   -h: help
   -u: unique
   -S n: size of presort, supports k/M/G suffix
         a traditional in-memory sort (default n=1M)
   -P n: use multiple threads (experimental, default disabled)
   -T: pass through (debugging/benchmarks)

estimating run time, crudely:
    time gzip -dc data.gz | gzip > /dev/null
    unthreaded: seconds * entropy * (log2(uncompressed_size/S)+2)
    (where 'entropy' is a fudge-factor between 1.5 for an
    already sorted file and 3 for a shuffled file)
    S and P are the corresponding settings
    multithreaded: maybe unthreaded/sqrt(P) ?

estimated disk use:
    2x source.gz

Minimum requirements to sort a terabyte:

Known bugs to fix

Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too.

Performance tweaks to try