trietptm / lz4

Automatically exported from code.google.com/p/lz4
0 stars 0 forks source link

lz4 limited output issues #31

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Compress data.
2. Recompress the same input with output limited to exactly the length of the 
previous run.

What is the expected output? What do you see instead?
Compression sometimes fails when there is exactly enough space.

What version of the product are you using? On what operating system?
svn r77, Arch Linux x86_64

Please provide any additional information below.
The calculation of the encoded run length was not exact, and could differ from 
the actual run length by a small amount in some cases. op + (encoded length) 
was also being compared as greater than or equal to oend, but a result exactly 
equal to oend means that the encoded data will exactly fill the buffer. I 
believe the attached patch corrects all length tests in lz4.c properly. lz4demo 
-b shows a small cost in compression speed, about 2.5% for larger data and 
about 7% for the 64k algorithm.

Original issue reported on code.google.com by andrew.m...@gmail.com on 24 Aug 2012 at 4:00

Attachments:

GoogleCodeExporter commented 9 years ago
I've gone over this again - did I miss something about how the lengths 
translate to encoded data? The >>8 actually appears to work just fine, which 
makes brings encoding back to basically original speed. Using a >=oend anywhere 
eventually causes a failure, though. A new patch and the extended fuzzer, which 
checks for correct behavior when the output buffer is both exactly sized and 
substantially too small, are attached.

Original comment by andrew.m...@gmail.com on 24 Aug 2012 at 4:50

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Andrew

Thanks for the very detailed analysis.
Yes, this behavior is correct.

A fast encoder such as LZ4 has many trade-off everywhere in its code. It is 
tuned for speed, and as a consequence, a lot of "approximations" (lost 
opportunities to win a few bytes here and there) are accepted in favor of speed.

This behavior fits into this category :
Should the compressed data normally fit into XXX bytes, and is provided an 
output buffer of exactly XXX bytes, then it may fail,
because its "write check" algorithm is only a fast approximation.

Obviously, a critical "write error" is not acceptable, 
but missing an opportunity to compress data by a few bytes is : it just makes 
the compression ratio worse. That's part of the trade-off.

The proposed approximation in r77 is fairly close to optimal : should the 
output buffer be something in the range of XXX+N bytes, with N a very small 
number, then the check will always work.

Now, of course, should a better check be possible without compromising speed, i 
will gladly welcome it, and it will be integrated into the main source trunk.

Best Regards

Original comment by yann.col...@gmail.com on 24 Aug 2012 at 8:59

GoogleCodeExporter commented 9 years ago
I added tests for nearly enough space to the fuzzer. The more complex math is 
needed on encoding lastRun to ensure that it doesn't run out of buffer, but 
after some time compressing random data it appears the other two checks are 
safe with the approximation. This patch has much better performance as a result 
- only lastRun gets the complex check, and so far it never runs over buffer or 
fails when the compressed output should fit.

Original comment by andrew.m...@gmail.com on 24 Aug 2012 at 12:18

Attachments:

GoogleCodeExporter commented 9 years ago
Great ! Thanks Andrew.
I'll look into it.

Original comment by yann.col...@gmail.com on 24 Aug 2012 at 12:26

GoogleCodeExporter commented 9 years ago
Please find in attached file
a proposed RC78
which takes into consideration your proposed modification
for exact output buffer check.

I've used this opportunity to also include your proposed fuzzer test tool,
which has been modified in order to become (hopefully) more portable.

Obviously, the proposed rc78 succesfully passes the fuzzer test.

Would you mind having a look, and tell if it works for you ?

Regards

Original comment by yann.col...@gmail.com on 4 Sep 2012 at 2:08

Attachments:

GoogleCodeExporter commented 9 years ago
The final test for 64K also needs to use (lastRun-RUN_MASK+255)/255 to avoid 
overrun, or else use the >= condition and occasionally fail for the case of 
exactly enough space. The fuzzer won't catch this failure as-is, but if you add 
a compression that uses just sligthly under len bytes for output, it will 
overrun.

It's a shame there's not an easily portable way to get time or some other data 
to use as a seed for the PRNG, it's a big help to portability to put the PRNG 
in the fuzzer, but with a static init every fuzzer run is using the same data. 
Other than that and the extra test, the improved/portable fuzzer looks fine to 
me - but I can only test on linux.

Original comment by andrew.m...@gmail.com on 6 Sep 2012 at 9:01

GoogleCodeExporter commented 9 years ago
Hi Andrew

I finally had a bit of time to work again on the issue.
The following zip package contains a few modifications which tries to solve 
your latest comments :

- The fuzzer is upgraded, to test both :
==> Strict output size (which must work)
==> Output size - 1 (which must fails)
Both conditions must be verified for each loop of the test.

- Input seed can be either manually selected, or is automatically determined by 
the program (not exactly random, but almost by the standard of this test).

- LZ4 64K end condition test has been modified according to your recommendation

With all these elements in place, i've been testing the updated LZ4 with the 
fuzzer many times (with each time meaning ~250K tests), and all tests have been 
succesfull so far. 
It seems LZ4 now correctly detects output buffer boundaries precisely, instead 
of being a bit over-cautious up to now.

Regards

Original comment by yann.col...@gmail.com on 18 Oct 2012 at 8:08

Attachments:

GoogleCodeExporter commented 9 years ago
Corrected into r79.
Thanks for clear report and detailed correction and test instructions.

Original comment by yann.col...@gmail.com on 24 Oct 2012 at 1:03