inikep / lzbench

lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors
885 stars 179 forks source link

A way to get summary potential compressibility information for an entire dataset #116

Open XXtreem11 opened 2 years ago

XXtreem11 commented 2 years ago

Using lzbench to be able to get a summary compression number for an entire dataset.

Use case: Have a directory with hundreds/thousands/millions of files (a dataset) and would like to see which compression alg would work the best on that dataset. I don't care about the individual file compressibility. Just care about the entire dataset compressibility at that point.

Current issue: lzbench runs through every single file in a dataset and gives compressibility information along with compression/decompession throughput. At some point I may care about throughput.. but for now, I only care about the overall summary compressibility of an entire dataset.

This is also per-algorithm.

Speed is also a factor at that point too as the tool runs through every file individually. I'm willing to wait a while for results, but would need some progress indicator.

Example of potential output: current dir consists of 1000 files, a few directories and files under those directories.

lzbench -ezstd -r . Compressor name Compress. Decompress. Compr. size Ratio Filename memcpy 1348 MB/s 2687 MB/s 1698448384 100.00 /dir/data/set/is/in/ zstd 1.5.0 -1 177 MB/s 1000 MB/s 1094580176 64.45 /dir/data/set/is/in/ zstd 1.5.0 -2 61 MB/s 658 MB/s 1065403069 62.73 /dir/data/set/is/in/ zstd 1.5.0 -3 175 MB/s 1063 MB/s 1085968586 63.94 /dir/data/set/is/in/ zstd 1.5.0 -4 58 MB/s 656 MB/s 1057966516 62.29 /dir/data/set/is/in/ zstd 1.5.0 -5 208 MB/s 1208 MB/s 1085740326 63.93 /dir/data/set/is/in/ zstd 1.5.0 -6 210 MB/s 1199 MB/s 1083948608 63.82 /dir/data/set/is/in/ zstd 1.5.0 -7 197 MB/s 661 MB/s 1082068109 63.71 /dir/data/set/is/in/ zstd 1.5.0 -8 151 MB/s 1063 MB/s 1078084969 63.47 /dir/data/set/is/in/