inikep / lizard

Lizard (formerly LZ5) is an efficient compressor with very fast decompression. It achieves compression ratio that is comparable to zip/zlib and zstd/brotli (at low and medium compression levels) at decompression speed of 1000 MB/s and faster.
Other
644 stars 40 forks source link

Tight, but not that much! #1

Closed Sanmayce closed 8 years ago

Sanmayce commented 8 years ago

Hi, inikep, happy new 2016!

Your lz5 fills the gap between lz4 and Zstd, nice. However, your next statement is valid mainly for mixed/binary data, yes? Or to be more precise, for compressors using up to 4MB dictionary, yes?

In my experiments there is no open-source bytewise compressor that gives better ratio than lz5hc.

For textual e.g. English texts it is not in my quick tests, my latest&last Nakamichi 'Goldenboy' yields better ratio while being faster in decompression (try Skylake):

Note: I used for all files below '>lz5 -18 infile' command line.


D:\_Deathship_textual_corpus\lz5>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\_Deathship_textual_corpus\lz5

01/06/2016  11:14 PM    <DIR>          .
01/06/2016  11:14 PM    <DIR>          ..
09/11/2015  05:25 AM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
01/06/2016  10:40 PM        11,894,354 Agatha_Christie_89-ebooks_TXT.tar.lz5
10/14/2015  02:29 PM        10,434,674 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
09/11/2015  05:25 AM        13,713,275 Complete_Works_of_Fyodor_Dostoyevsky.txt
01/06/2016  09:42 PM         4,981,395 Complete_Works_of_Fyodor_Dostoyevsky.txt.lz5
10/12/2015  09:00 AM         4,544,039 Complete_Works_of_Fyodor_Dostoyevsky.txt.Nakamichi
09/11/2015  05:25 AM        10,192,446 dickens
01/06/2016  09:35 PM         3,936,253 dickens.lz5
10/11/2015  04:14 PM         3,722,075 dickens.Nakamichi
01/06/2016  10:26 PM           544,256 lz5-1.3.3.tar
01/06/2016  10:27 PM           112,315 lz5-1.3.3.tar.lz5
01/06/2016  11:14 PM           145,822 lz5-1.3.3.tar.Nakamichi
01/05/2016  07:47 PM           217,556 lz5.exe
01/01/2002  04:41 AM           120,832 Nakamichi_Kintaro_Intel_15.0_64bit.exe
01/01/2002  05:50 AM         3,028,450 Nakamichi_Kintaro_source_booklet_executables_32bit_64bit_GCC510_Intel150.zip
11/02/2015  10:33 PM       132,728,832 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar
01/06/2016  10:20 PM        24,712,948 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.lz5
12/01/2015  08:56 AM        25,592,405 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.Nakamichi
09/11/2015  05:25 AM        12,030,464 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar
01/06/2016  09:47 PM         4,519,211 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.lz5
10/14/2015  05:18 PM         4,328,336 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.Nakamichi
09/11/2015  05:25 AM        14,613,183 The_Book_of_The_Thousand_Nights_and_a_Night.txt
01/06/2016  10:26 PM         5,502,069 The_Book_of_The_Thousand_Nights_and_a_Night.txt.lz5
10/12/2015  03:26 AM         5,228,912 The_Book_of_The_Thousand_Nights_and_a_Night.txt.Nakamichi
01/06/2016  10:59 PM            92,096 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt
01/06/2016  10:59 PM            37,008 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.lz5
10/11/2015  02:12 PM            43,944 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.Nakamichi
01/06/2016  10:56 PM         4,445,260 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt
01/06/2016  10:22 PM         1,435,859 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.lz5
10/11/2015  11:15 AM         1,420,630 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.Nakamichi
              30 File(s)    337,577,395 bytes
               2 Dir(s)  83,927,027,712 bytes free

D:\_Deathship_textual_corpus\lz5>Nakamichi_Kintaro_Intel_15.0_64bit.exe The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.Nakamichi /bench
Nakamichi 'Kintaro', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: This compile can handle files up to 1711MB.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 1420630 bytes ...
RAM-to-RAM performance: 704 MB/s.
Compression Ratio (bigger-the-better): 3.13:1

D:\_Deathship_textual_corpus\lz5>Nakamichi_Kintaro_Intel_15.0_64bit.exe dickens.Nakamichi /bench
Nakamichi 'Kintaro', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: This compile can handle files up to 1711MB.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 3722075 bytes ...
RAM-to-RAM performance: 512 MB/s.
Compression Ratio (bigger-the-better): 2.74:1

D:\_Deathship_textual_corpus\lz5>Nakamichi_Kintaro_Intel_15.0_64bit.exe The_Book_of_The_Thousand_Nights_and_a_Night.txt.Nakamichi /bench
Nakamichi 'Kintaro', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: This compile can handle files up to 1711MB.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 5228912 bytes ...
RAM-to-RAM performance: 448 MB/s.
Compression Ratio (bigger-the-better): 2.79:1

D:\_Deathship_textual_corpus\lz5>Nakamichi_Kintaro_Intel_15.0_64bit.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /bench
Nakamichi 'Kintaro', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: This compile can handle files up to 1711MB.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 10434674 bytes ...
RAM-to-RAM performance: 320 MB/s.
Compression Ratio (bigger-the-better): 3.19:1

D:\_Deathship_textual_corpus\lz5>lz5.exe -h
LZ5 command line interface 64-bit v1.3.3 by Y.Collet & P.Skibinski (Jan  5 2016)
Usage :
      lz5.exe [arg] [input] [output]

input   : a filename
          with no FILE, or when FILE is - or stdin, read standard input
Arguments :
 -0       : Fast compression (default)
 -1...-18 : High compression; higher number == more compression but slower
 -d       : decompression (default for .lz5 extension)
 -z       : force compression
 -f       : overwrite output without prompting
 -h/-H    : display help/long help and exit

Advanced arguments :
 -V     : display Version number and exit
 -v     : verbose mode
 -q     : suppress warnings; specify twice to suppress errors too
 -c     : force write to standard output, even if it is the console
 -t     : test compressed file integrity
 -m     : multiple input files (implies automatic output filenames)
 -l     : compress using Legacy format (Linux kernel compression)
 -B#    : Block size [4-7](default : 7)
 -BD    : Block dependency (improve compression ratio)
--no-frame-crc : disable stream checksum (default:enabled)
--content-size : compressed frame includes original size (default:not present)
--[no-]sparse  : sparse mode (default:enabled on file, disabled on stdout)
Benchmark arguments :
 -b     : benchmark file(s)
 -i#    : iteration loops [1-9](default : 3), benchmark mode only

D:\_Deathship_textual_corpus\lz5>\lz5 -18 -b --no-frame-crc The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt
The_Project_Gute :   4445260 ->   1435836 (32.30%),    0.1 MB/s ,  455.6 MB/s

D:\_Deathship_textual_corpus\lz5>\lz5 -18 -b --no-frame-crc dickens
dickens          :  10192446 ->   3936226 (38.62%),    0.1 MB/s ,  390.3 MB/s

D:\_Deathship_textual_corpus\lz5>\lz5 -18 -b --no-frame-crc The_Book_of_The_Thousand_Nights_and_a_Night.txt
The_Book_of_The_ :  14613183 ->   5502038 (37.65%),    0.1 MB/s ,  400.1 MB/s

D:\_Deathship_textual_corpus\lz5>\lz5 -18 -b --no-frame-crc Agatha_Christie_89-ebooks_TXT.tar
Agatha_Christie_ :  33258496 ->  11894307 (35.76%),    0.1 MB/s ,  410.8 MB/s

D:\_Deathship_textual_corpus\lz5>

Note: The results were obtained on my laptop with Core 2 Q9550s, Win7 64bit.

As a final note, I wonder why Hamid didn't implement LzTurbo 19 with bigger window as well?! I fully expect his state-of-the-art parser to kick our asses right away, hee-hee!

xcrh commented 8 years ago

Not like if I care about text books much, but when compressing (previously uncompressed) x86-64 Linux kernel, LZ5 proven to be quite interesting tradeoff.

It compresses much better than LZ4 does, though somewhat slower to decompress. But still fast. It also beats LZO, both in terms of ratio and decompression speed, actually improving both at once. Thanks to @Inikep hints I've also gave a shot to some "custom" levels and chopped off extra ~120K without losing decompression speed, first by setting match finding to much more aggresive settings and then also giving a try to larger dictionary, which improves things a bit on 22M chunk of data. Though it easy to make compression slow/memory hog while gain is modest vs sharp increase in resources consumptions. But it can be okay if one needs relatively small amount of precompressed data. In fact, it beats Crush level 2 in many cases. Which isn't that bad and is bit-aligned, not byte aligned. So it slower to decompress.

TBH I have some trouble finding proper "competitors" to LZ5. Most "fast LZ" things are just nowhere close in terms of ratio (and it is possible to get a bit extra at cost of compression speed). Closest in terms of ratio was LZO1Z-999. But it lost by noticeable margin. And was slower to decompress. Slower things usually go for entropy coding and/or at least bit-aligned streams, it tends to kill decompression speed, typically 2x or worse.

In case of kernel it is a like this: first you read it (slow storage gains from decrease of read amount). Then you decompress it (here you can lose all gains if decompression is slow). Stroger things like LZMA introduce noticeable delay, like 1 second on more or less fast PC or several times more on weaker systems (e.g. ARM boards). Next major competitor in terms of ratio is zlib, but it decompresses like 2x slower...

If you want to see what the best one can get from "just LZ" without "real" entropy coding, take a look on https://github.com/alef78/lzoma - on mentioned kernel it got me between Brotli level 6 and 7. But unlike Btrotli it lacks entropy coding, which makes it rather impressive. Just ultimate match finding and some tweaks of data format, which do not slow down decompression too much. But there is price. It is very slow and hogs memory a lot during compress...

inikep commented 8 years ago

Happy New Year 2016!

  1. In my experiments (lzbench, TurboBench, squash) with my files there was no better open-source bytewise compressor. It means "lz5 is really good, check it by yourself" :)
  2. You can always find files that will not suit a given compressor. It seems that your files need more than 4MB dictionary, but it's unusual. As you know a bigger dictionary hurts other types of files.
  3. lz5.exe divides input file into independent 16 MB blocks. In lz5 v1.4 (dev branch) this can be fixed with -B option:
>lz5 -15 -b -i1 win81
d:\Ocarina-pliki : 104857600 ->  47070215 (44.89%),    1.8 MB/s ,  709.9 MB/s
>lz5 -15 -B7 -b -i1 win81
d:\Ocarina-pliki : 104857600 ->  45767126 (43.65%),    2.2 MB/s ,  707.8 MB/s
inikep commented 8 years ago

A bigger dictionary means much slower decompression. LZ4 -15 is similar to lzturbo 19. Currently LZ5 v1.4 -15 is close to lzturbo 29:

54606867      52.1       2.08         2723.54        lzturbo 19
45425792      43.3       1.92          868.88        lzturbo 29
lz4hc r131 -15               14 MB/s    2007 MB/s    54741827  52.21
lz5hc v1.4 level 15        2.19 MB/s    636 MB/s     45767126  43.65
Sanmayce commented 8 years ago

Currently LZ5 v1.4 -15 is close to lzturbo 29:

Very good!

  1. You can always find files that will not suit a given compressor. It seems that your files need more than 4MB dictionary, but it's unusual. As you know a bigger dictionary hurts other types of files.

To make the quick comparison I just used some classics from my Deathship corpus, no picking at all. Another bunch of 4-MB textual ones: Yesterday saw one programmer praising Brandon Sanderson's The_Way_of_Kings novel a lot, so I included it as well: To have the 'baseline' Igor's zip is added using max:

7za a -tgzip -mx9 Brandon_Sanderson_-_The_Way_of_Kings.txt.zip Brandon_Sanderson_-_The_Way_of_Kings.txt
01/06/2016  11:17 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
01/06/2016  10:40 PM        11,894,354 Agatha_Christie_89-ebooks_TXT.tar.lz5
10/14/2015  02:29 PM        10,434,674 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
01/07/2016  07:29 PM        11,173,195 Agatha_Christie_89-ebooks_TXT.tar.zip
01/07/2016  06:45 PM         2,315,596 Brandon_Sanderson_-_The_Way_of_Kings.txt
01/07/2016  06:47 PM           868,298 Brandon_Sanderson_-_The_Way_of_Kings.txt.lz5
01/07/2016  07:06 PM           855,905 Brandon_Sanderson_-_The_Way_of_Kings.txt.Nakamichi
01/07/2016  07:22 PM           796,999 Brandon_Sanderson_-_The_Way_of_Kings.txt.zip
09/11/2015  05:25 AM        13,713,275 Complete_Works_of_Fyodor_Dostoyevsky.txt
01/06/2016  09:42 PM         4,981,395 Complete_Works_of_Fyodor_Dostoyevsky.txt.lz5
10/12/2015  09:00 AM         4,544,039 Complete_Works_of_Fyodor_Dostoyevsky.txt.Nakamichi
01/07/2016  07:23 PM         4,617,360 Complete_Works_of_Fyodor_Dostoyevsky.txt.zip
01/06/2016  11:16 PM        10,192,446 dickens
01/06/2016  09:35 PM         3,936,253 dickens.lz5
10/11/2015  04:14 PM         3,722,075 dickens.Nakamichi
01/07/2016  07:28 PM         3,681,828 dickens.zip
01/07/2016  07:43 PM         5,245,293 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt
01/07/2016  07:45 PM         2,014,525 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt.lz5
10/11/2015  06:02 PM         1,929,859 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt.Nakamichi
01/07/2016  07:43 PM         1,869,849 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt.zip
01/06/2016  10:26 PM           544,256 lz5-1.3.3.tar
01/06/2016  10:27 PM           112,315 lz5-1.3.3.tar.lz5
01/06/2016  11:14 PM           145,822 lz5-1.3.3.tar.Nakamichi
01/07/2016  07:42 PM           109,343 lz5-1.3.3.tar.zip
11/02/2015  10:33 PM       132,728,832 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar
01/06/2016  10:20 PM        24,712,948 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.lz5
12/01/2015  08:56 AM        25,592,405 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.Nakamichi
01/07/2016  07:34 PM        25,418,601 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.zip
09/11/2015  05:25 AM        12,030,464 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar
01/06/2016  09:47 PM         4,519,211 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.lz5
10/14/2015  05:18 PM         4,328,336 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.Nakamichi
01/07/2016  07:30 PM         4,192,756 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.zip
01/06/2016  11:16 PM        14,613,183 The_Book_of_The_Thousand_Nights_and_a_Night.txt
01/06/2016  10:26 PM         5,502,069 The_Book_of_The_Thousand_Nights_and_a_Night.txt.lz5
10/12/2015  03:26 AM         5,228,912 The_Book_of_The_Thousand_Nights_and_a_Night.txt.Nakamichi
01/07/2016  07:23 PM         5,198,949 The_Book_of_The_Thousand_Nights_and_a_Night.txt.zip
09/11/2015  05:25 AM         3,714,387 The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt
01/07/2016  06:34 PM         1,369,532 The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt.lz5
01/07/2016  05:51 PM         1,331,298 The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt.Nakamichi
01/07/2016  07:23 PM         1,285,462 The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt.zip
09/11/2015  05:25 AM         3,158,453 THE_COMPLETE_WORKS_OF_LEWIS_CARROLL.epub.txt
01/07/2016  06:37 PM         1,123,931 THE_COMPLETE_WORKS_OF_LEWIS_CARROLL.epub.txt.lz5
01/07/2016  06:30 PM         1,180,322 THE_COMPLETE_WORKS_OF_LEWIS_CARROLL.epub.txt.Nakamichi
01/07/2016  07:23 PM         1,085,503 THE_COMPLETE_WORKS_OF_LEWIS_CARROLL.epub.txt.zip
09/11/2015  05:25 AM         4,087,444 The_Encyclopedia_of_Psychoactive_Plants.txt
01/07/2016  06:44 PM         1,437,010 The_Encyclopedia_of_Psychoactive_Plants.txt.lz5
01/07/2016  05:59 PM         1,513,508 The_Encyclopedia_of_Psychoactive_Plants.txt.Nakamichi
01/07/2016  07:23 PM         1,486,501 The_Encyclopedia_of_Psychoactive_Plants.txt.zip
01/06/2016  10:59 PM            92,096 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt
01/06/2016  10:59 PM            37,008 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.lz5
10/11/2015  02:12 PM            43,944 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.Nakamichi
01/07/2016  07:23 PM            30,329 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.zip
09/11/2015  05:25 AM         4,596,124 The_Oxford_Thesaurus_an_A-Z_Dictionary_of_Synonyms.txt
01/07/2016  06:45 PM         1,783,669 The_Oxford_Thesaurus_an_A-Z_Dictionary_of_Synonyms.txt.lz5
01/07/2016  08:05 PM         1,757,778 The_Oxford_Thesaurus_an_A-Z_Dictionary_of_Synonyms.txt.Nakamichi
01/07/2016  07:23 PM         1,728,347 The_Oxford_Thesaurus_an_A-Z_Dictionary_of_Synonyms.txt.zip
01/07/2016  07:46 PM         7,137,280 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar
01/07/2016  07:49 PM         2,565,163 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar.lz5
10/11/2015  05:16 PM         2,438,374 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar.Nakamichi
01/07/2016  07:50 PM         2,418,599 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar.zip
09/11/2015  05:25 AM         2,347,772 The_Project_Gutenberg_EBook_of_Don_Quixote_996.txt
01/07/2016  06:43 PM           909,000 The_Project_Gutenberg_EBook_of_Don_Quixote_996.txt.lz5
01/07/2016  07:11 PM           912,824 The_Project_Gutenberg_EBook_of_Don_Quixote_996.txt.Nakamichi
01/07/2016  07:23 PM           842,984 The_Project_Gutenberg_EBook_of_Don_Quixote_996.txt.zip
01/06/2016  11:15 PM         4,445,260 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt
01/06/2016  10:22 PM         1,435,859 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.lz5
10/11/2015  11:15 AM         1,420,630 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.Nakamichi
01/07/2016  07:23 PM         1,320,100 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.zip

As you may know my world is English texts, they are in no way rare or a special case, in my experiments it is not unusual with weak parser as mine, my suggestion was to strengthen both the window size and the parser. In case you are interested, another suggestion, can you make a dedicated parser tuned for general English texts (novels, mainly), that way it would be a nifty addition to your 4 strong ones.

As for now the decompression suffers from the stalled RAM reads, yes, however let us look in the future and have some decompressor disregarding the status quo. This notion has two interesting aspects, first to benchmark how RAM progresses and second to have, say, 32 threads (256MB sliding window each) bombarding memory controller with such outside caches reads, I was curious to see how such mix would play out i.e. what the masking of waiting threads deliver. One OCN fellow helped me to see that forcing the manycores to work (instead of relying solely on SSD's burst/linear reads, even Intel SSD 750 series with linear 2500MB/s speed) has future. If I can upload Enwiki (as a whole ~52000MB) in 21s at 2500MB/s it is cool but even cooler if 8cores/16threads do the upload for 21/3s + 52000/5200s = 17s. Using some SATAIII with 520MB/s then the 100s become 100/3s + 52000/5200s = 33s + 10s, or 2x boost. The old idea, to speedup the full-text processing by utilizing the given CPU power.

Using 5960x core/uncore/memory 4.5/4.0/2666c12:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 16
omp_get_max_threads( ) = 16
All threads finished.
Decompression time: 6,391,237,566 ticks.
TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.146
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 6.855

Kernel  Time =     5.078 =  142%
User    Time =    34.437 =  963%
Process Time =    39.515 = 1105%    Virtual  Memory =  11176 MB
Global  Time =     3.575 =  100%    Physical Memory =  11154 MB

Let's see how many bytes/clock of decompression speed those 6.855 equal:

(32 threads * 273,401,856) bytes / (6,391,237,566 ticks / 4,500,000,000 ticks) = 6,159,975,569 bytes / second or 5,874 MB/s decompression speed of English texts. :p

When 32 cores go mainstream then this "mumbo-jumbo" will make more sense. Note that those 273,401,856 represent 411 English novels which are very indicative of how general English texts would behave.

Source: http://www.overclock.net/t/1510328/asus-x99-motherboard-series-official-support-thread-north-american-users-only/9150_50#post_24340866 http://www.overclock.net/t/1564781/cpu-ram-subsystem-benchmark-freaky-dreamer-reporting-ipc-instructions-per-clock/0_50#post_24165464

Sanmayce commented 8 years ago

Just read: http://encode.ru/threads/2166-FreeArc-Next?p=43304&viewfull=1#post43304

Currently LZ5 v1.4 -15 is close to lzturbo 29:

Does this goody come as a result of LzTurbo knowhow utilization?

In my eyes, the crazy world we live in is open for all kinds of ripping/pirating, no problema, NO JUDGING, the only thing that I hate and cannot accept - not telling best to the best, and the lack of due appreciation to author's work. To me, LzTurbo is one of those phenomenal things that transcended current stereotypes, in outstanding manner. Best!

xcrh commented 8 years ago

Hmm, thanks to new matchfinder in -dev I've got my new highscore on compressing kernel. Nice! :)

Does this goody come as a result of LzTurbo knowhow utilization?

This is very unlikely, if you take a look around, you'll soon figure out Przemyslaw Skibinski (Inikep) is a serious data compression expert on his own, and his compressors scored high ranks in famous benchmarks several times. Overall, LZ5 code derived from LZ4, but bitstream format has been altered to allow better compressio, but format is rather trivial/logical and hardly most advanced thing around, you can find bunch of similar ideas or different tradeoffs in dozens of other compressors. But this format easy/fast to parse while not being worst ratio's enemy. So I bet he only applied his expertise. E.g. added stronger matchfinders, etc. There're dozens of various matchfinders in the world, stronger or faster, or something. LZ5 overall just happens to be more or less adequate balance for some use cases. E.g, IMHO it looks sensible for compression of Linux kernel or ram disk, due to somewhat beating LZO in terms of ratio and then being also somewhat faster to decompress.

Taking something from LZTurbo is quite hard because... there is no source, to begin with, to the best of my knowledge. Btw, I can't/wouldn't use LZTurbo or its algos for use cases like compressing kernel, in this case proprietary licensing just would not do, no matter what. Also, LZTurbo author does not haves monopoly on innovations. Furthermore, most techniques used are more or less known to those dealing with compression algos, ideas itself aren't intellectual property either, and if they were, LZTurbo author would get in big trouble on his own. E.g. he haven't invented LZ compression in first place. These were Abraham Lempel and Jacob Ziv, as "LZ" suggests. In some jurisdictions, exact algo implementation could be patented. But when it happens, it only stiffles innovation and people usually have to find dozen and half ways to do similar things different ways. Like it happened to arithmetic compression. While I'm not really understand why someone should be allowed to patent numbers/math at all, I can admit there was like a dozen of subflavours which do not viloate any patent, only most straightforward implementations got under fire.

Btw, if we're up for pirates, if I remember correctly, LZTurbo author has been caught on pirating GPLed source and refused to open source of resulting binary, therefore violating GPL and hence qualifying as "pirate". So if you're so inclined on piracy and licensing violations, you can find some better targets, to begin with...

inikep commented 8 years ago

Currently LZ5 v1.4 -15 is close to lzturbo 29: Does this goody come as a result of LzTurbo knowhow utilization?

A good catch :) Yes, I have working lzturbo -29 decompressor (an easy task to do for all bytewise compressors) because I was interested if good ratio comes either from parser/finder or from codewords. And in my opinion it comes from parser/finder because there is nothing special with lzturbo -29 codewords (lz5 uses similar but different codewords). I know nothing however about a parser/finder used in lzturbo.

But you should also notice that:

  1. There is lack of levels -23 to -28 in lzturbo which would be very helpful and LZ5 fulfills this gap. Moreover LZ4 fulfills the gap of levels -13 to -18 :)
  2. LZ5 is open-source and Hamid does not make any claims about ripping from lzturbo.

Actually I developed a compressor called LZMAX for Dell and it has better ratio than lzturbo -29 (using bytewise codewords and similar compression/decompression speed).

Sanmayce commented 8 years ago

Thank you, inikep. That was that interested me, I am careless of pirating since my world is much larger than pimpish games most people play. The thing that inspires me and makes me happy is watching superspeedy textual processing.

And in my opinion it comes from parser/finder...

Guess, that can be called the 'heart' of 19/29 - the state-of-the-art parser.

Actually I developed a compressor called LZMAX for Dell and it has better ratio than lzturbo -29 (using bytewise codewords and similar compression/decompression speed).

Excellent! I guess the BEST LZ then is LZMAX, how come that even fan like me didn't know who the King was?! I guess LzTurbo is the king in 39/19 realms:

C Size          ratio%  C MB/s   D MB/s  Name
32798929        32.8    2.79     65.49   lzma 9
32922377        32.9    1.61     69.65   lzturbo 49
33761620        33.7    2.64     277.04  lzham 4
34109576        34.1    2.17     1318.56 lzturbo 39
35638896        35.6    1.19     950.41  zstd 20
36944243        36.9    69.99    1411.77 lzturbo 32
37313258        37.3    2.40     2149.57 lzturbo 29
41668560        41.6    0.19     246.43  brotli 11
...
50337788        50.3    6.73     1428.58 lz5 9
52597358        52.5    262.30   2068.57 lzturbo 21
52928477        52.9    69.17    276.75  zlib 1
53112430        53.1    298.70   442.42  zstd 1
54265487        54.2    2.01     3883.96 lzturbo 19

inikep, please excuse me flooding your project, simply the query of mine about decompression king took that way... or as bros. Diaz use to say - it is whatever!

@xcrh Man, you just don't get it, do you? I hesitated for a moment to answer your dump of thoughts, but seeing your openness, well:

Also, LZTurbo author does not haves monopoly on innovations.

Why so, who says this, on contrary, LZTurbo author does have monopoly on innovations, muahahaha!

Btw, if we're up for pirates, if I remember correctly, LZTurbo author has been caught on pirating GPLed source and refused to open source of resulting binary, therefore violating GPL and hence qualifying as "pirate". So if you're so inclined on piracy and licensing violations, you can find some better targets, to begin with...

Beware when talking pitareship, once it was more valuable to the Queen/King than knightship, they overlapped at one time.

'British Pirates in Print and Performance' https://books.google.co.uk/books?id=a6-_BwAAQBAJ&pg=PA42&lpg=PA42&dq=pirate+knightship&source=bl&ots=x_fpgdmV_Z&sig=qKzG4SBIkp2vGjRGFWf2UG8_liY&hl=en&sa=X&ved=0ahUKEwigs_Xyzp3KAhUITBQKHZx3C0kQ6AEIMzAG#v=onepage&q=pirate%20knightship&f=false

As for piracy/licenses, you project a lot while missing the point altogether, it is not about policy it is about SPEED. Speed is religion in itself. For serious users (especially corporations) it is even more, it is a matter of prestige. Not that the two come hand in hand. For me topspeed performers inspire and bring tomorrow-that-never-comes closer.

It seems you are unfamiliar with the notion of freedom, before start citing what dictionaries and pseudo-laws dictate, let me give you my understanding: Freedom means one is free to to do everything, e.g. one country decides to attack other country - this is aggression, some say, other say, it is defending the world peace by using preventing strikes, if you ask me, it is freedom - even to do the wrong thing, to what end it is another matter.

If it sounds too abstract, no worries it is hardcore Laoism, it states: For the world is a divine vessel: It cannot be shaped; Nor can it be insisted upon. He who shapes it damages it; He who insists upon it loses it. It means that despite our world appears as hellish it is not, it is not harmonic either, it is free, that's why most people cannot fathom why the Spirit/Divinity allows atrocities and what not, the vessel (i.e. material world) is a battleground/testbench for expressing freedom.

Quickly tried Alexandr Efimov's LZOMA:

01/07/2016  06:45 PM         2,315,596 Brandon_Sanderson_-_The_Way_of_Kings.txt
01/07/2016  06:47 PM           868,298 Brandon_Sanderson_-_The_Way_of_Kings.txt.lz5
01/09/2016  10:27 PM           720,735 Brandon_Sanderson_-_The_Way_of_Kings.txt.lzoma
01/07/2016  07:06 PM           855,905 Brandon_Sanderson_-_The_Way_of_Kings.txt.Nakamichi
01/07/2016  07:22 PM           796,999 Brandon_Sanderson_-_The_Way_of_Kings.txt.zip

09/11/2015  05:25 AM        13,713,275 Complete_Works_of_Fyodor_Dostoyevsky.txt
01/06/2016  09:42 PM         4,981,395 Complete_Works_of_Fyodor_Dostoyevsky.txt.lz5
01/09/2016  11:44 PM         3,928,197 Complete_Works_of_Fyodor_Dostoyevsky.txt.lzoma
10/12/2015  09:00 AM         4,544,039 Complete_Works_of_Fyodor_Dostoyevsky.txt.Nakamichi
01/07/2016  07:23 PM         4,617,360 Complete_Works_of_Fyodor_Dostoyevsky.txt.zip

01/06/2016  10:59 PM            92,096 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt
01/06/2016  10:59 PM            37,008 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.lz5
01/09/2016  10:21 PM            31,757 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.lzoma
10/11/2015  02:12 PM            43,944 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.Nakamichi
01/07/2016  07:23 PM            30,329 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.zip

01/06/2016  11:15 PM         4,445,260 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt
01/06/2016  10:22 PM         1,435,859 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.lz5
01/09/2016  10:59 PM         1,189,496 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.lzoma
10/11/2015  11:15 AM         1,420,630 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.Nakamichi
01/07/2016  07:23 PM         1,320,100 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.zip

Good idea of his, worth pirating, hah-hah.

xcrh commented 8 years ago

how come that even fan like me didn't know who the King was?!

TBH I'm still being surprised learning how idea of LZ could be improved more and more. It looks so simple at first glance.

modes 19/39 left me in awe;

Well, and when I've stumbled on LZOMA I haven't thought entropy-less LZ going to do so good. Good thing about it is simple decoder, which lacks reasons to be slow.

doesn't kick asses in decompression department;

Maybe I wasn't clear enough, but I think I stated I do not really care much about texts. So I never evaluated performance of various things on data like this. TBH I do not get point of compressnig Wikipedia data with proprietary compression. Wikipedia is a world knowledge, valuable due to universal availability. Proprietary compression is oppposite of this. Time would go, CPUs would change, systems would change. Being unable to decode old human knowledge due to undocumented proprietary algo sounds stupid and sad to me.

Binary-only thing also does not kicks ass in decompressing me a Linux kernel. In no way proprietary algo would fit here. It is very specific environment. Nobody keeps code capable working here in their backpacks. But if thing is opensource, someone can get source and change it appropriately. Just to give idea: imagine the empty world. There is only piece of compressed data in memory, memory and CPU. No files, no threads, no programs, no even malloc() unless one is going to create it. They all would appear ... later.

Real-world use of compression algos IMHO takes multi-point evaluation, decompression speed and ratio are important, sure. But these are not the only things to care of. If something just does not runs on platforms I care of, its a bummer. What is the point of ratio, if you can't decompress it where you need it?

Why so, who says this, on contrary, LZTurbo author does have monopoly

Sounds like either marketing bullshit or praying to hightech gods. Either way, sounds stupid. I think LZTurbo is better at marketing and questionable "showcases". I've easily managed to put some other "showcases" for few other codecs, just a matter of doing benchmarks "right". Even simplistic brieflz can show us who is a king of compression, lol. Just give it right data.

I guess LZTurbo could be using some optimizations, e.g. carefully crafted assembly-based decoder to speed things up. Compilers can generate crappy code, and it really matters in tight decompression loop. When someone takes care of it, like LZ4 author did, result could be quite good. But it took him learning about compilers and their stupid attitudes, and how to persuade them to behave. I guess LZTurbo author possibly did something like this, or even wrote decoder in assembly, since he supports only really few platforms. But I'm pretty sure there is nothing magical inside.

And if speed is so important, you may want to take a look on post from one MS programmer. Who told us nobody gives a fuck about speed optimizations. And you're not going to get extra payments if you optimized some part of Windows. It can be quite opposite - one can face punishment for doing something different than assigned task. So, when you see some corporation loudly declares something, take it with a grain of salt. Or, sometimes, with a whole truck of salt.

As for "pirating" mentioned thing, not really sure what you want to pirate. And actually, opensource is about sharing. But respecting each other in the process. Those who does not respect licenses of open programs are IMHO showing utter disrespect to their authors. Human has opened mind, and those violating license shown disrespect. Needless to say, I'm getting rather negative opinion about such attitude.

Btw, Inikep's description of his algo tells how he selected bitstream formart, in the "The codewords description" section. And it also clearly not the worst tradeoff I ever seen. Isn't it nice when someone does not minds sharing some bits of his expertise in some complicated area?

akdjka commented 8 years ago

" Those who does not respect licenses of open programs are IMHO showing utter disrespect to their authors." No, these are entirely different things. I don't respect copyright law itself. From this, I don't respect any content license. But I do respect many authors.

xcrh commented 8 years ago

I don't respect copyright law itself. From this, I don't respect any content license.

Feel free to use CC0/Public Domain things, etc. Their authors explicitly stated they do not care what you do using their creations. And actually, looking on those who only want to consume and never bothers self to even try to give anything back, I'm getting to thing copyright can have some point. Though it clearly being abused to pad interests of large corporations rather than something else. But it seems large corporations got rather unexpected outcome. Uhm, I mean things like "GPL tarballs" are looking quite funny to my taste, I bet that was not part of plan :)

Furthermore, if you'll dare to actually violate copyright laws here on github, you can count on offending things getting removed in a timely manner. We may or may not like it, but that's how it works.

But I do respect many authors.

And so, why you have to disregard their wishes to have proper attribution (BSD license) or their wish to get contributions back (GPL), etc? I do not see why it supposed to be something bad.

Sure, in ideal world people would just care to tell "thanks", attribute authors and try to contribute back. But in real world it often happens you face bunch of greedy careless sharks instead. Except everything else, licenses can also be used to limit harm from greedy/selfish/uncooperative attitude.

akdjka commented 8 years ago

You're mixing up two things. Respect and obeyance. I usually obey licenses....though that's mostly because the stuff that I do violates few of them. If I respect the author, I am sometimes willing to go out of my way to obey their rules. But it happened to me to mix GPL and CDDL sources despite the respect for authors. Illegal? Sure. Wrong? Not for me. [OT] GPL is not about wishing for contributions; nearly all open source projects want them too. GPL ones just wave legal guns to extort them from bodies who don't want to share. [/OT]

xcrh commented 8 years ago

Respect and obeyance.

In fact both have the very same root cause. Freedom of one being ends where freedom of other being begins. Failure to recognize and respect this fact leads to anarchy. Anarchy usually transforms into wild tyranny, because stronger one eventually wins and since there're no laws and other silly crap left, nothing restricts strongest entity from doing whatever they want to. Seems anarchy fans are missnig this point all the time and then it plays bad joke on them when some Sony, Hollywood and somesuch are suing some unlucky bastards or just pushing hardware trojans/backdoors right into hardware. Why do you think Secure Boot has appeared? As you can guess, major copyright holders have a bit more money than you can afford. Are you ready for world where you going to be locked out and where you're going to run only trusted code? Uhm, no, they are not going to trust you. And in fact, as you've shown to us, there was some reason to let it happen this way. When rampant piracy takes place... sure, you can expect wild DRM/SecureBoot/etc carnage as responce. If you want war, you get one. So, are you ready for digital nuclear winter? Can you diffuse your own CPU to run your code? Because these copyright holders with private keys could be in mood to lock you out. So you either pay for their stuff or enjoy by empty screen on system which does not even obeys you.

Wrong? Not for me.

Oh, come on, good luck pushing this direction. This way you'll be both hardly welcome by those writing code and good prey for major copyright holders. I wonder if you have splendid plan and can diffuse your own CPUs, in case your CPU would obey manufacturer will (and since copyright holders got more money, guess what it going to be).

GPL ones just wave legal guns to extort them from bodies who don't want to share.

Right. And isn't it funny to turn draconian laws against those who created it? Everyone have to obey laws? Okay, let's see if major copyright holders and companies pushing draconian laws are ready to get what it takes. Its also fun someone is so smart he created self-spreading algo shaped as license. This is very strange way to write down self-spreading algos. But it works. And gives me a lot of lulz, when greedy commercial corp's are either turned onto strong contributors or gettings grilled on legal grounds. This is a virus. And it only hurts greedy/uncooperative entities. I think it very funny and smart idea and it greatly changed whole software landscape. Uhm, I remember times when it was hard even to unpack archive in legal ways in sane amount of time. Nice this tricky plan worked and now I can enjoy by powerful, open system. Whatever, but as Linux kernel development example shown, these processes are kicking the ass.

NB: I've been implementing very tough, hardware-based DRM schemes in the past to protect some things from free-riders like some of commenters. Yet, after taking a look on opensource communities I've got idea there're other ways around. Without stupid greed and attempts to pwn each other. And GPL is clearly not the worst approach I've seen. Though for compression algos, more liberal licensing can make sense if someone cares about widespread data format usage more than about contributions, etc.

rdp commented 7 years ago

Since this seems to be the default "mailing list" (might be nice to have a mailing list BTW)... :) ... I wonder if LZ5 could be incorporated into LZ4 itself as an "option" so to speak, have you asked them about it? Then it could get widespread use more easily I think. Just throwing it out there :)

mappu commented 7 years ago

I ran a comparison to find optimal speed-per-ratio compressors on a small corpus. LZ5 performed well in the range between cat and zstd. There are some graphs and scripts at https://code.ivysaur.me/compression-performance-test/ .

inikep commented 7 years ago

Thanks for your tests. I wonder why do you use cygwin64? All compressors should be faster with MinGW-w64.

For 64-bit you may try: https://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Personal%20Builds/mingw-builds/6.2.0/threads-posix/seh/

For 32-bit you may try: https://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/6.2.0/threads-posix/dwarf/