Open KirillKryukov opened 4 years ago
Firstly, thank you for such a well written and detailed report!
I am afraid this behavior of the --optimal
level is to be expected. In the current version it uses a binary tree match finder, which works well on many types of data, but has worst-case quadratic performance on certain types of data. The homo sapiens set in question contains parts that exhibit this.
The usual way to avoid such behavior is to limit the match search in some way (for instance stop when a certain number of potential matches have been checked, or when a match of acceptable length has been found -- this is what level 9 does). But in order to find the bit-optimal compressed representation for the BriefLZ format, it has to consider all possible matches.
I would suggest sticking to levels 1-9 with the current version of BriefLZ if your data may trigger this.
The function description of blz_pack_level()
in the include file, as well as the command-line help of the blzpack
utility, warn that the optimal level may be very slow. Would it help if I add that it may be very slow depending on the type of data?
Very glad to see BriefLZ performing in Kirill's SCB, just would like to see the effectiveness in action, therefore Joergen, please provide the options you wanna see. For instance, obviously using the default block size (by the way what is its size) is far from what BriefLZ can deliver, since the machine of Kirill is so cool for memory hogs as Nakamichi and (~45GB for 2GB block for BriefLZ):
timer64 BriefLZ_130_Intel_v19_64bit.exe --optimal -b2g "%1" "%1.blz"
And one quick console dump:
E:\TEXTUAL_MADNESS_bare-minimum_2020_Apr_28>timer64 bcrush_ICL150_64bit.exe --optimal ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar"" ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar".optimal.bcrush"
Kernel Time = 0.250 = 1%
User Time = 23.375 = 97%
Process Time = 23.625 = 98% Virtual Memory = 1420 MB
Global Time = 23.911 = 100% Physical Memory = 844 MB
E:\TEXTUAL_MADNESS_bare-minimum_2020_Apr_28>timer64 crush_xezz.exe -9 ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar"" ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar".9.crush_xezz"
Compressing Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar:
40303104 -> 9163198 in 18.43s
Kernel Time = 0.234 = 1%
User Time = 18.250 = 97%
Process Time = 18.484 = 98% Virtual Memory = 852 MB
Global Time = 18.728 = 100% Physical Memory = 545 MB
E:\TEXTUAL_MADNESS_bare-minimum_2020_Apr_28>timer64 BriefLZ_130_Intel_v19_64bit.exe --optimal -b2g ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar"" ""Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar".blz"
Kernel Time = 1.078 = 3%
User Time = 30.312 = 95%
Process Time = 31.390 = 99% Virtual Memory = 45402 MB
Global Time = 31.650 = 100% Physical Memory = 2829 MB
40,303,104 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
9,163,198 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.9.crush_xezz
8,851,573 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.blz
7,021,419 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.bro
7,038,202 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.L22W2GB.zst
8,619,461 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.method211.zpaq
5,726,983 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.method511.zpaq
7,123,784 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.MX9.bzip2
10,079,339 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.MX9.gzip
6,943,680 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.MX9Dict1024.7z
9,155,961 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.Nakamichi
8,502,992 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.optimal.bcrush
5,937,587 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.rz
5,315,954 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.ST0Block1024.bsc
6,849,828 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.ST6Block1024.bsc
5,446,644 Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar.zv
As for the 'SILVA' ~1GB DNA:
Compressor Size(B) Ratio(times) C-Time(s) D-Time(s) C-Mem(MB) TD-Time(s)
copy-cat 1,108,994,702 1.00 0.666 0.234 0.690 88.95
lz4-9 233,573,222 4.75 155.0 0.592 5.290 19.28
nakamichi 137,427,694 8.07 9,751 1.105 88,408 12.10
brieflz-opt 168,241,177 6.59 428.6 2.741 21.97 16.20
On i5-7200u 2.7GHz my old (a year ago) test shows:
D:\TEXTORAMIC_benchmarking_2019-Aug-06>lzbench173 -c4 -i1,15 -o3 -etornado,16/blosclz,9/brieflz/crush,2/csc,5/density,3/fastlz,2/gipfeli/lzo1b,999/libdeflate,1,12/lz4hc,1,12/lizard,19,29,39,49/lzf,1/lzfse/lzg,9/lzjb/lzlib,9/lzma,9/lzrw,5/lzsse2,17/lzsse4,17/lzsse8,17/lzvn/pithy,9/quicklz,3/snappy/slz_zlib,3/ucl_nrv2b,9/ucl_nrv2d,9/ucl_nrv2e,9/xpack,1,9/xz,9/yalz77,12/yappy,99/zlib,1,5,9/zling,4/shrinker/wflz/lzmat SILVA_132_SSURef_Nr99_tax_silva.fasta
lzbench 1.7.3 (64-bit Windows) Assembled by P.Skibinski
Compressor name Compress. Decompress. Orig. size Compr. size Ratio Filename
lzlib 1.8 -9 0.93 MB/s 140 MB/s 1108994702 58323686 5.26 SILVA_132_SSURef_Nr99_tax_silva.fasta
tornado 0.6a -16 1.28 MB/s 419 MB/s 1108994702 60275681 5.44 SILVA_132_SSURef_Nr99_tax_silva.fasta
lzma 16.04 -9 1.38 MB/s 248 MB/s 1108994702 62229404 5.61 SILVA_132_SSURef_Nr99_tax_silva.fasta
xz 5.2.3 -9 1.57 MB/s 236 MB/s 1108994702 62230769 5.61 SILVA_132_SSURef_Nr99_tax_silva.fasta
csc 2016-10-13 -5 2.53 MB/s 184 MB/s 1108994702 73401349 6.62 SILVA_132_SSURef_Nr99_tax_silva.fasta
Nakamichi 'Ryuugan-ditto-1TB' 1725 MB/s 105649099 ! outside lzbench !
lizard 1.0 -29 1.00 MB/s 2063 MB/s 1108994702 105745551 9.54 SILVA_132_SSURef_Nr99_tax_silva.fasta
lizard 1.0 -49 0.92 MB/s 2047 MB/s 1108994702 110132321 9.93 SILVA_132_SSURef_Nr99_tax_silva.fasta
crush 1.0 -2 0.18 MB/s 570 MB/s 1108994702 143260107 12.92 SILVA_132_SSURef_Nr99_tax_silva.fasta
xpack 2016-06-02 -9 7.17 MB/s 979 MB/s 1108994702 168644300 15.21 SILVA_132_SSURef_Nr99_tax_silva.fasta
zling 2016-01-10 -4 46 MB/s 291 MB/s 1108994702 190047636 17.14 SILVA_132_SSURef_Nr99_tax_silva.fasta
libdeflate 0.7 -12 3.26 MB/s 944 MB/s 1108994702 196295435 17.70 SILVA_132_SSURef_Nr99_tax_silva.fasta
ucl_nrv2e 1.03 -9 0.37 MB/s 416 MB/s 1108994702 205317109 18.51 SILVA_132_SSURef_Nr99_tax_silva.fasta
zlib 1.2.11 -9 2.16 MB/s 385 MB/s 1108994702 206566505 18.63 SILVA_132_SSURef_Nr99_tax_silva.fasta
lzsse2 2016-05-14 -17 2.00 MB/s 4319 MB/s 1108994702 206623630 18.63 SILVA_132_SSURef_Nr99_tax_silva.fasta
...
D:\TEXTORAMIC_benchmarking_2019-Aug-06>zstd-v1.4.2-win64.exe -b1e22 -i9 --priority=rt "SILVA_132_SSURef_Nr99_tax_silva.fasta"
Note : switching to real-time priority .fasta...
1#9_tax_silva.fasta :1108994702 -> 210161592 (5.277), 299.5 MB/s , 743.8 MB/s
2#9_tax_silva.fasta :1108994702 -> 233260454 (4.754), 273.4 MB/s , 645.7 MB/s
3#9_tax_silva.fasta :1108994702 -> 177601790 (6.244), 208.5 MB/s , 692.4 MB/s
4#9_tax_silva.fasta :1108994702 -> 175148431 (6.332), 174.1 MB/s , 703.7 MB/s
5#9_tax_silva.fasta :1108994702 -> 222428075 (4.986), 113.7 MB/s , 643.0 MB/s
6#9_tax_silva.fasta :1108994702 -> 201332037 (5.508), 78.1 MB/s , 694.3 MB/s
7#9_tax_silva.fasta :1108994702 -> 186684821 (5.940), 51.5 MB/s , 774.1 MB/s
8#9_tax_silva.fasta :1108994702 -> 179492127 (6.179), 43.7 MB/s , 807.5 MB/s
9#9_tax_silva.fasta :1108994702 -> 165048155 (6.719), 23.8 MB/s , 861.9 MB/s
10#9_tax_silva.fasta :1108994702 -> 161026970 (6.887), 25.1 MB/s , 881.0 MB/s
11#9_tax_silva.fasta :1108994702 -> 159628337 (6.947), 21.4 MB/s , 880.2 MB/s
12#9_tax_silva.fasta :1108994702 -> 146761536 (7.556), 11.3 MB/s , 931.6 MB/s
13#9_tax_silva.fasta :1108994702 -> 121097075 (9.158), 3.95 MB/s ,1025.9 MB/s
14#9_tax_silva.fasta :1108994702 -> 116047146 (9.556), 4.36 MB/s ,1057.0 MB/s
Nakamichi 'Ryuugan-ditto-1TB' 105649099 1725 MB/s ! outside zstdbench !
15#9_tax_silva.fasta :1108994702 -> 102816474 (10.79), 2.05 MB/s ,1050.9 MB/s
16#9_tax_silva.fasta :1108994702 -> 88817385 (12.49), 2.35 MB/s ,1001.7 MB/s
17#9_tax_silva.fasta :1108994702 -> 80163392 (13.83), 2.03 MB/s ,1014.6 MB/s
18#9_tax_silva.fasta :1108994702 -> 78822655 (14.07), 1.87 MB/s , 978.7 MB/s
19#9_tax_silva.fasta :1108994702 -> 73759356 (15.04), 1.36 MB/s ,1010.7 MB/s
20#9_tax_silva.fasta :1108994702 -> 64128802 (17.29), 1.26 MB/s ,1050.8 MB/s
21#9_tax_silva.fasta :1108994702 -> 60540485 (18.32), 1.13 MB/s ,1093.3 MB/s
22#9_tax_silva.fasta :1108994702 -> 58543278 (18.94), 0.99 MB/s ,1118.4 MB/s
So, it would be nice to see best of BriefLZ juxtaposed to other performers, especially to my toy.
Firstly, thank you for such a well written and detailed report!
It's my pleasure, and thank you for reply!
I am afraid this behavior of the
--optimal
level is to be expected. In the current version it uses a binary tree match finder, which works well on many types of data, but has worst-case quadratic performance on certain types of data. The homo sapiens set in question contains parts that exhibit this.The usual way to avoid such behavior is to limit the match search in some way (for instance stop when a certain number of potential matches have been checked, or when a match of acceptable length has been found -- this is what level 9 does). But in order to find the bit-optimal compressed representation for the BriefLZ format, it has to consider all possible matches.
Thank you for explanation. (May be this could be worth mentioning in the readme?)
I would suggest sticking to levels 1-9 with the current version of BriefLZ if your data may trigger this.
Problem is that I did not know that my data will trigger it. Testing with smaller data did not reveal it, until trying a 3 GB dataset. I'm afraid someone else might experience the same situation, and not have the patience (or time budget) to wait for compression to finish.
The function description of
blz_pack_level()
in the include file, as well as the command-line help of theblzpack
utility, warn that the optimal level may be very slow. Would it help if I add that it may be very slow depending on the type of data?
I'm afraid it would not be enough. It's technically and strictly speaking correct to say "very slow" and "very slow depending on the type of data", but it will still not enable the user to build accurate model of what to expect.
The slowdown is ~100 times compared to its normal performance on other data. From the viewpoint of the user, it's indistinguishable from blzpack being frozen.
If I experienced this slowdown using blzpack for actual data compression (rather than benchmarking), I would certalinly kill the process (rather than waiting 27 hours) and conclude that it's broken. (And re-consider whether I can rely on a this broken compressor at all, even using other levels).
I would recommend clearly explaining this possible slowdown in a way that's hard to miss. E.g. something like:
"Important note about --optimal mode: Please use it only for testing, and never in production. It's compression speed is unpredictable and can get 100 times slower depending on data. It should be only used with unlimited time budget."
I fully understand being curious about the ultimate limit of the compression ratio, thus in the --optimal mode. I just think the user normally will not expect the slowdown, and it will help to add warning about the slowdown.
In addition, here is a chart comparing compression speed of several compressors:
I included only general purpose compressors, and all datasets. Only strongest level (single thread) of each compressor is shown.
You can highlight each compressor by clicking on its marker in the legend.
This may help to compare the variation in compression speed of BriefLZ with other popular compressors.
Your SCB is the best in many regards/departments, it serves quite well in estimating the effectiveness in real scenarios.
These old results of Nakamichi are bettered by the 2020-May-09 revision, personally I would love to see a clash/showdown of Lizard 49 vs Nakamichi vs BriefLZ (Joergen's favorite options) vs pigz-11-4t, with: X axis: Test data size (MB) Y axis: Transfer + decompression speed (MB/s) as in Link speed: 100 Mbit/s (for estimating transfer time):
Also, when decompressing from non-remote i.e. local (on an SSD SATA 3 with 6GB/s) i.e. Link speed: 6000 Mbit/s (for estimating transfer time):
Thanks, Georgi.
A quick note about BriefLZ decompression speed results in these charts. Currently it is handicapped by not supporting streaming output. I.e., blzpack always outputs to a file (as far as I understand). However in my benchmark I always measure streaming decompression speed. Therefore, for compressors without such streaming mode (such as blzpack), I let them decompress to a file, then stream this file, and measure the total time.
If blzpack will get support for streaming mode, its decompression speed results will improve. (This happened with Nakamich after Georgi added such streaming mode recently).
Streaming mode is important for many practical applications, where you don't want to store decompressed data in a file, but instead process it on the fly with another tool, as part of data analysis pipeline.
(I also use only streaming mode during compression, but compression is much slower so the impact is comparatively smaller for blzpack. However its compression speed results should also improve with addition of streaming mode).
(Note that only uncompressed data is streamed in my benchmark. For compressed data it's OK to work with a file directly).
Thank you for the comments both.
What is the default blocksize?
The example program blzpack
uses a default block size of 1 MiB.
What is the maximum blocksize and how much memory (roughly) it uses, like in 22N?
The example program is limited to a block size of roughly 3.6 GiB (-b3600m
) -- this is due to storing the compressed size in a 32-bit value in the block header, and the max expansion of uncompressible data.
The compression library itself is limited by the size of the type used to store offsets and lengths (blz_word
), which is currently a 32-bit value to reduce memory usage.
The optimal level uses 5 * block size words of work memory, which for 32-bit values is 20N bytes. Add to that the current input and output blocks which are also held in memory and you get 22N.
For reference (in 1.3.0) levels 1-4 use a fixed amount of workmem, levels 5-7 use 12N bytes of workmem, and levels 8-10 use 20N bytes of workmem.
Does using -b4g make any sense with, say level 1?
Usually not. Lower compression levels limit the search depth (at level 1 only the most recent match is checked), so using larger block sizes may have less of an impact than on higher levels. That being said, you could of course construct inputs where it would matter hugely, like 1GiB of random data repeated.
Playing around with some genome data today, I must admit sometimes it gives surprising results compared to other types of data.
Is -9 -b4g the best combination, currently, for Human Genome (~3.3GB)? What options you personally want to see?
If you have the ~80 GiB of RAM it would take, then -9 -b3600m
should give the highest compression avoiding the behavior of --optimal
on certain areas. -6 -b3600m
might be a better trade-off between compression ratio and time.
From a few quick tests, it does not appear that the block size has a huge effect on the compression ratio on the genome data.
Problem is that I did not know that my data will trigger it. Testing with smaller data did not reveal it, until trying a 3 GB dataset. I'm afraid someone else might experience the same situation, and not have the patience (or time budget) to wait for compression to finish.
That is a good point, I will try to come up with a suitable warning.
Streaming mode is important for many practical applications, where you don't want to store decompressed data in a file, but instead process it on the fly with another tool, as part of data analysis pipeline.
If by streaming you mean the ability to read blocks from stdin and write blocks to stdout then that would be possible to add. blzpack
is really just a simple example to let people play around with the BriefLZ library.
If you have the ~80 GiB of RAM it would take, then
-9 -b3600m
should give the highest compression avoiding the behavior of--optimal
on certain areas.-6 -b3600m
might be a better trade-off between compression ratio and time.
I'll try to add these settings to the benchmark in the future.
I will try to come up with a suitable warning.
Thanks.
If by streaming you mean the ability to read blocks from stdin and write blocks to stdout then that would be possible to add.
blzpack
is really just a simple example to let people play around with the BriefLZ library.
Yes, this is exactly what I mean, sorry that it was unclear.
BriefLZ decompression is fast enough for this to matter in my tests, so this would be a very welcome addition.
Joergen, you are welcome to see my latest post (inhere my old web-browser prohibits me of posting attachments and pictures), looking forward to seeing how BriefLZ performs with bigger blocks, for now, I included the default use: https://community.centminmod.com/posts/84082/
Regarding streaming, if you are on Linux you could try simply specifying stdin and stdout as filenames, like:
./blzpack -9 /dev/stdin /dev/stdout <foo >foo.blz
BriefLZ compression speed with "--optimal" varies wildly depending on input data.
I used "blzpack --optimal" to compress several biological datasets. Most of time it compresses at speed of about 1 to 3 MB/s (sometimes 6). However, when compressing human genome, its speed dropped to 33 kB/s. I tried it twice, both time it took ~27 hours to complete.
In case if this is normal and expected behaviour, I think it should be documented, so that users know the risks of using "--optimal" mode.
blzpack seems to works fine with other settings. I used BriefLZ 1.3.0, commit 0ab07a5, built using instructions from readme. I used only default block size so far, though I plan to test it with other block sizes too.
Test machine: Ubuntu 18.04.1 LTS, dual Xeon E5-2643v3, 128 MB RAM, no other tasks running.
This test was a part of Sequence Compression Benchmark: http://kirr.dyndns.org/sequence-compression-benchmark/ - this website includes all test data, commands, and measurements.
In particular, all test data is available at: http://kirr.dyndns.org/sequence-compression-benchmark/?page=Datasets
Please let me know if you need any other details or help reproducing this issue.