Format description - Githubissues

JoernEngel commented 8 years ago

I want to implement an lz5 encoder. Sadly there doesn't seem to be any format description beyond the source code. Is the lz5 format stable enough for you to document things already?

inikep commented 8 years ago

I think that the format is stable now. I wanted to make as little changes as possible from LZ4 to LZ5. I updated the following files to show a real LZ5 format description: https://github.com/inikep/lz5/blob/dev/lz5_Block_format.md https://github.com/inikep/lz5/blob/dev/lz5_Frame_format.md

JoernEngel commented 8 years ago

I think I mostly figured it out. Start by writing an lz4-encoder. The format is simpler and you can easily verify the result. Then turn 4-4-16 lz4 matches into 3-3-16 lz5 matches. You waste compression, but can test against lz5 uncompress.

When those basics are done, then you start using the new features of lz5. Shorter encoding, longer window and repmatch.

inikep commented 8 years ago

Why do you want to have your own compressor? Do you want to make it faster? Do you want to try your own capabilities? Is it a license?

JoernEngel commented 8 years ago

A couple of reasons. The most interesting for you is that lz5 compression speed is horrible for block-based compression. Main source is the large memory allocation + memset every time we start compression.

I suspect that overhead doesn't matter as much when compressing large files with many megabytes. But having many calls to LZ5_compress_HC(src, dst, slen, dlen, 13); will just kill performance. That is especially annoying because you otherwise shred lzo. Lz5 compresses better and decompresses faster.

Second reason is that I wanted to play around with using suffix arrays for lz-compressors. People have been talking about this for years, but I haven't found any code. So I wrote my own for lz4. Works reasonably well and, combined with optimal parsing, beats lz4 on compression density. But it is much slower than lz4 compression.

So the obvious next target is lz5. Thanks to your alloc+memset bug my compressor is faster than yours, giving me just the excuse to work on it and see if I can create a fast lz5 compressor on my own.

My compressor is also faster than lzo compressor with greedy/lazy parsing and a little slower than lzo with optimal parsing. There are still some optimizations to be head, so staying faster than lzo is perfectly possible. Being better than lzo and faster on both compression and decompression looks appealing, so that is what I'm shooting for. ;)

Compression times in ns/byte in my private benchmark: 0.073 memcpy 0.990 lz4 20.613 lz4hc level 16 50.506 lz4sa lazy parser 131.418 lz4sa optimal parser 3.303 lz5 828.739 lz5hc level 13 0.893 lzo1 94.529 lzo999

JoernEngel commented 8 years ago

Looks like I have something that beats lzo now. Compresses .6% better, encode is 32% faster and your decode is 36% faster. Nice!

Repmatch was the key.

inikep commented 8 years ago

Congrats. You compressor seems interesting. Are you going to publish it?

LZ5 and 22-bit offsets were not designed for small files, but you are right. I've changed calloc to malloc in LZ5_alloc_mem_HC. I was worried that without memset(ctx->chainTable) in LZ5HC_init() we can get much slower compression with higher levels (searchNum >= 8), but in my today's experiments the speed was the same.

JoernEngel commented 8 years ago

I should publish it, yes. Probably under GPL. The aim currently isn't to be a product for general use, it is mainly me playing around with the concept and trying to learn things. So if I would fault you for ignoring small block compression, you could fault me in return for ignoring far more use-cases.

The important points, imo are:

LZ4 is a pretty impressive compressor for its simplicity. Thanks, Yann!
LZ4 compresses worse than LZO because MINMATCH=4 is too large and every match encoding consumes 3+ bytes. Reducing MINMATCH to 3 or even 2 helps and shorter match encodings really help.
LZ5 format is pretty close to optimal in those respects, while still being closely related to LZ4 and simple. Thanks, Przemyslaw!
LZO lacks a repmatch encoding. Repmatch improves compression ratio from 1.813 to 1.842, giving 1.5% incremental improvement.

Suffix arrays work well for the particular niche I am interested in, but not in the generic case. They will give you every possible match, particularly the longest match, and never suffer from hash collisions. That is nice for high compression ratios.

Generating the suffix array is expensive. Prohibitively expensive for fast compressors. And the cost per byte goes up as block sizes go down. At 32k blocksize the suffix array costs ~40ns/byte to give a ballpark number.

So unless you are dealing with large blocks, want good compression ratio and are willing to pay the runtime cost, avoid suffix arrays. But if compression can be 100x slower than decompression, they might be a good option. If you want optimal parsing, the cost of suffix arrays no longer dominates and the lack of hash collisions makes them really interesting.

Memory cost for blocksizes <64k is 6n, plus another 8n for optimal parsing. Blocksizes >64k cost 12n plus 16n, as you move from 16bit to 32bit pointers.

xcrh commented 8 years ago

In my experiments (I've compressed previously uncompressed Linux kernel and quite many other things), LZ5 already beats LZO for quite some time, especially version in -dev. At least in terms of ratio I can achieve AND decompressing speed. Both at once. Needless to say I like it :)

Though these were relatively large chunks of data, where larger window pays for itself, and I had little care of compression speed: I guess -HC versions of both LZ4 and LZ5 were never meant for "realtime" compression, rather for generating "precompressed" data. Still, if one up for better ratios, LZ5 data format is better than LZ4. LZ4 design is all about speed, even if it comes at cost of ratio. OTOH LZ5 format both lacks reasons to be slow and can afford quite good ratio.

...and on side note: trying to fiddle with MINMATCH in LZ5 causes compilation to fail. I've been curious to try it, but setting anything but 3 just breaks compilation, at lest in -dev:

lz5/lz5hc.c: In function ‘LZ5HC_compress_price_fast’:
lz5/lz5hc.c:1541:21: error: ‘HashTable3’ undeclared (first use in this function)
         HashPos3 = &HashTable3[LZ5HC_hash3Ptr(ip, ctx->params.hashLog3)];

JoernEngel commented 8 years ago

When compressing large files, a large sliding window obviously helps. LZO only has 48k, making it an easy target to beat.

Compressing small blocks is a different matter. When LZO, LZ4 and LZ5 all have sliding windows larger than the blocksize, having short match encodings and small MINMATCH matter. LZ4 loses with 3-byte encoding and MINMATCH=4. LZO vs. LZ5 is more interesting. LZO lacks repmatch, which really hurts. But it has MINMATCH=2, which is surprisingly interesting.

Encoding a 2-byte match in 2 bytes looks like a silly thing to do. But it helps break up long strings of literals. To encode 6 literals with LZ5 you usually need an extra byte for litlen. But if you get 2 literals, a 2-byte match and 2 more literals, you can skip the extra byte for litlen. On the other hand, 9-byte matches will now require an extra byte to encode matchlen.

So it might be interesting to try MINMATCH=2 for LZ5 and quantify how much it helps/hurts. It probably hurts more than it helps. Judging by the LZO format, it is used maybe 5% of the time. 9-byte matches may well be more common than that.

inikep commented 8 years ago

IMHO the best improvement for small files can be achieved by designing new codewords. The compressor can detect small files and encode them more efficiently with new codewords. One can make a special codeword for MINMATCH=2.

@xcrh: LZ5 -dev works now also for MINMATCH=4.

JoernEngel commented 8 years ago

Btw, what is the best way to provide patches? I ran into compiler warnings when using LZ5 due to missing const propagation.

commit e73bc934e7c1 Author: Joern Engel joern@logfs.org Date: Thu Dec 31 10:41:45 2015 -0800

fix lz5 compiler warnings

Signed-off-by: Joern Engel <joern@logfs.org>

diff --git a/lz5/lz5common.h b/lz5/lz5common.h index 392c7f570ece..440946fd0ad3 100644 --- a/lz5/lz5common.h +++ b/lz5/lz5common.h @@ -246,7 +246,7 @@ struct LZ5HC_Datas const BYTE* end; /* next block here to continue on current prefix / const BYTE base; /* All index relative to this position / const BYTE_ dictBase; /* alternate base for extDict */

BYTE* inputBuffer; /* deprecated */
const BYTE* inputBuffer; /* deprecated / BYTE outputBuffer; /* deprecated / U32 dictLimit; / below that point, need extDict / U32 lowLimit; / below that point, no more dict */ diff --git a/lz5/lz5hc.c b/lz5/lz5hc.c index de18f9b66f65..8257b6955618 100644 --- a/lz5/lz5hc.c +++ b/lz5/lz5hc.c @@ -575,7 +575,7 @@ static int LZ5HC_compress_lowest_price ( limitedOutput_directive limit ) {
ctx->inputBuffer = (BYTE*) source;
ctx->inputBuffer = (const BYTE) source; ctx->outputBuffer = (BYTE) dest; const BYTE* ip = (const BYTE) source; const BYTE anchor = ip; @@ -627,15 +627,15 @@ _Search: { int price, best_price; U32 off0=0, off1=0;
uint8_t pos, best_pos;
```
 const uint8_t *pos, *best_pos;
```
// find the lowest price for encoding ml bytes
best_pos = (uint8_t*)ip;
best_pos = (const uint8_t*)ip; best_price = 1<<30;
off0 = (uint8_t*)ip - ref;

 off0 = (const uint8_t*)ip - ref;
off1 = start2 - ref2;

 for (pos = (uint8_t*)ip + ml; pos >= start2; pos--)

for (pos = (const uint8_t*)ip + ml; pos >= start2; pos--) { int common0 = pos - ip; if (common0 >= MINMATCH) @@ -720,7 +720,7 @@ static int LZ5HC_compress_price_fast ( limitedOutput_directive limit ) {
ctx->inputBuffer = (BYTE*) source;
ctx->inputBuffer = (const BYTE) source; ctx->outputBuffer = (BYTE) dest; const BYTE* ip = (const BYTE) source; const BYTE anchor = ip; @@ -851,7 +851,7 @@ static int LZ5HC_compress_fast ( limitedOutput_directive limit ) {
ctx->inputBuffer = (BYTE*) source;
ctx->inputBuffer = (const BYTE) source; ctx->outputBuffer = (BYTE) dest; const BYTE* ip = (const BYTE) source; const BYTE anchor = ip;

inikep commented 8 years ago

Thanks, the warnings are fixed now. I think that the best way to provide patches is a pull request.

xcrh commented 8 years ago

@JoernEngel

When compressing large files, a large sliding window obviously helps. LZO only has 48k, making it an easy target to beat.

Sure, but LZO1-{X,Y,Z} provide quite good ratios on levels like 999. So they are easy prey only on some kinds of data, where larger window pays for itself. Still, I've easily found ~200KiB file of compressible, real world data, where LZO1Z-999 puts a show to other "fast LZ"s, LZ5HC included. Actually on this file LZO1Z-999 scored over zstd hc level 20 in terms of ratio (!!!). And anyone else in lzbench doing better ratio is >=2.5x slower to decompress, being heavyweight monster. Easy prey, eh? Is it MINMATCH=2 in effect?

And thanks for insights on small blocks, I see, it is really different tradeoff. Btw, it seems your appoach with suffixes is somehow similar to what LZOMA thing does, but it targets to achieve "monster" ratios while staying simple/fast to decompress. But its compressor is quite slow/memory hungry and I guess it wasn't meant for really small blocks. But on large blocks ratio could be "jawdroping" for entropyless LZ. I suspect your thing could do something similar on large chunks of data. So feel free to release it, etc. Maybe it would be LZ5_HC_HC on large blocks :). And sometimes one may want to get best possible ratio within some pre-existing format, even at cost of compressor being slow/memory hungry, etc. Though its not what you want for filesystem, except maybe precompressed squashfs- or initrd-like things.

@inikep yeah, I see. I've tested it, but it actually killed ratio on most data I've tested. Sounds like if I'm not a big fan of MINMATCH=4, at least on data I've tested so far.

JoernEngel commented 8 years ago

On Mon, Feb 08, 2016 at 05:13:15AM -0800, xcrh wrote:

@JoernEngel

When compressing large files, a large sliding window obviously helps. LZO only has 48k, making it an easy target to beat.

Sure, but LZO1-{X,Y,Z} provide quite good ratios on levels like 999. So they are easy prey only on some kinds of data, where larger window pays for itself. Still, I've easily found ~200KiB file of compressible, real world data, where LZO1Z-999 puts a show to other "fast LZ"s, LZ5HC included. Actually on this file LZO1Z-999 scored over zstd hc level 20 in terms of ratio (!!!). And anyone else in lzbench doing better ratio is >=2.5x slower to decompress, being heavyweight monster. Easy prey, eh? Is it MINMATCH=2 in effect?

While I don't know the file, I might be able to generate an artificial one with these attributes. To defeat the entropy coding in zstd you want mostly random literals. You should avoid repeated offsets to defeat the repmatch in lz5. Now have lots of 3-8 byte matches within 1k to 2k offsets. Below 1k would help lz5, above 2k would hurt lzo.

Having long streams of 1-3 literals and 2 byte matches would hurt everyone else while not hurting lzo. But I think the 3-8 byte matches are dominant in your file.

And thanks for insights on small blocks, I see, it is really different tradeoff. Btw, it seems your appoach with suffixes is somehow similar to what LZOMA thing does, but it targets to achieve "monster" ratios while staying simple/fast to decompress. But its compressor is quite slow/memory hungry and I guess it wasn't meant for really small blocks. But on large blocks ratio could be "jawdroping" for entropyless LZ. I suspect your thing could do something similar on large chunks of data. So feel free to release it, etc. Maybe it would be LZ5_HC_HC on large blocks :). And sometimes one may want to get best possible ratio within some pre-existing format, even at cost of compressor being slow/memory hungry, etc. Though its not what you want for filesystem, except maybe precompressed squashfs- or initrd-like things.

Will happen, but don't hold your breath. I am busy elsewhere for a while and don't want to publish the royal mess I currently have.

Jörn

Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982

inikep / lizard

Format description #2