Implements ByteBuffer support for all LZ4 methods

blambov commented 10 years ago

Addresses Issue #9.

Covers all implementations:

JNI interfaces are extended to deal with direct buffer addresses.
Unsafe implementations work on the underlying arrays. Direct buffers are handled by converting the address of the block to an offset from a null reference.
Safe Java implementations work on the ByteBuffer interfaces directly.

Additional changes applied:

As memory addresses can be larger than int, unsafe implementations require long offsets. As this has a very noticeable performance impact on 32-bit hardware, the patch provides a separate 'Long' version of each unsafe class which is chosen when the address size is longer than 32 bits.
Range checks are removed from Safe classes, as the JVM will already do the verification at each access.

The efficiency of the JNI and Unsafe methods on byte buffers is on par with those on arrays, with some benefits in the case of JNI, direct buffers and active garbage collection. Safe methods are slower due to the ByteBuffer access overheads.

belliottsmith commented 10 years ago

@jpountz any chance this might be integrated some time soon? we're keen to use it in Apache Cassandra

jpountz commented 9 years ago

Hi @blambov, thanks for the pull request, it looks great in general and I know how much work such changes require! I have some questions/comments:

API-wise, maybe we don't need methods with offsets on byte buffers such as int decompress(ByteBuffer src, int srcOff, int srcLen, ByteBuffer dest, int destOff)? void decompress(ByteBuffer src, ByteBuffer dest) should be enough since ByteBuffers already carry an offset and a length?
I think it would be nice to document the impact of compression/decompression on the ByteBuffers (is the position moved?)
I like the way that you optimized for native byte order
Did you measure the improvement on 32-bits hardware with your specialization? I assumed that the method signatures with an int just forwarded to the impls with a long? Or maybe the improvement comes from another place that I missed?
The fact that tests are duplicated makes them harder to maintain, maybe we could just add ByteBuffer testing to testRoundTrip(byte[], int, int, LZ4Compressor, LZ4FastDecompressor, LZ4SafeDecompressor) in LZ4Test? All test cases end up delegating to that method. This would also allow to make sure that eg. byte arrays and byte buffers are compressed the same way.

blambov commented 9 years ago

API-wise, maybe we don't need methods with offsets on byte buffers such as int decompress(ByteBuffer src, int srcOff, int srcLen, ByteBuffer dest, int destOff)? void decompress(ByteBuffer src, ByteBuffer dest) should be enough since ByteBuffers already carry an offset and a length?

The reason I added a separate version with offsets is to permit thread-safe concurrent access using the same buffer. If you use the position/limit in a buffer, the buffer immediately becomes a mutable resource that needs exclusive access. To use such a buffer from multiple threads (which we need, e.g., to compress multiple sections in the same large memory space concurrently) usually one duplicates the buffer, which is an extra overhead we could do without.

I think it would be nice to document the impact of compression/decompression on the ByteBuffers (is the position moved?)

There are "Neither buffer's position is moved."/"The positions in both buffers are moved to reflect the bytes read/written." sentences in the JavaDoc in LZ4Compressor, LZ4FastDecompressor, LZ4SafeDecompressor. Is there a specific place in code that you have in mind, or do you want more verbose explanations? Do you want me to add something about reusing buffers from multiple threads as well?

Did you measure the improvement on 32-bits hardware with your specialization? I assumed that the method signatures with an int just forwarded to the impls with a long? Or maybe the improvement comes from another place that I missed?

The int/long versions are just in the unsafe implementations. Whenever they access memory they do an offset from a pointer base (where the pointer base can change at any point in the execution due to GC). The pointer base changes size on 32-bit vs. 64-bit JVMs, but the offset does not as Java does not have a ptr-sized integer type. That normally isn't a problem as arrays don't go above 32-bit size. Once direct buffers come into play, however, things change as now the pointer base has to be null and the offset has to accommodate the whole address of the buffer in memory, thus it has to be long on a 64-bit JVM. The code itself does a lot of calculations with these offsets, and I think that's what ends up making it almost twice slower with longs on a 32-bit JVM (~220MB/s vs ~400 with the included test). There are other alternatives to fixing this (e.g. ptr base + long base + int offset addressing, or using long offsets only for direct buffers), but this seems to be the fastest option.

The fact that tests are duplicated makes them harder to maintain, maybe we could just add ByteBuffer testing to testRoundTrip(byte[], int, int, LZ4Compressor, LZ4FastDecompressor, LZ4SafeDecompressor) in LZ4Test? All test cases end up delegating to that method. This would also allow to make sure that eg. byte arrays and byte buffers are compressed the same way.

I'll see if I can improve this today.

blambov commented 9 years ago

The fact that tests are duplicated makes them harder to maintain, maybe we could just add ByteBuffer testing to testRoundTrip(byte[], int, int, LZ4Compressor, LZ4FastDecompressor, LZ4SafeDecompressor) in LZ4Test? All test cases end up delegating to that method. This would also allow to make sure that eg. byte arrays and byte buffers are compressed the same way.

This is now done.

jpountz commented 9 years ago

Is there a specific place in code that you have in mind, or do you want more verbose explanations?

No thanks, that's great. I don't know how I missed it!

Thanks for the changes to the tests, it looks great.

I think that's what ends up making it almost twice slower with longs on a 32-bit JVM (~220MB/s vs ~400 with the included test)

I don't know how important it is for Cassandra, but if you don't care too much about performance on 32-bits systems, I am personally ok with not even trying to optimize for these systems. I tend to be more scared by the potential bugs it could create that I would not find since I almost always run a 64-bits JVM, than I am tempted by the performance improvements. It could also make the code easier to maintain.

I just played with your changes but I am getting some test failures because of LZ4HCJavaUnsafeLongCompressor:

java.lang.AssertionError: -535436463
    at __randomizedtesting.SeedInfo.seed([A8CD5DCB9C22D61B:7A2D888E3FFC4FB5]:0)
    at net.jpountz.lz4.LZ4HCJavaUnsafeLongCompressor$HashTable.addHash(LZ4HCJavaUnsafeLongCompressor.java:56)
    at net.jpountz.lz4.LZ4HCJavaUnsafeLongCompressor$HashTable.insert(LZ4HCJavaUnsafeLongCompressor.java:66)
    at net.jpountz.lz4.LZ4HCJavaUnsafeLongCompressor$HashTable.insertAndFindBestMatch(LZ4HCJavaUnsafeLongCompressor.java:76)
    at net.jpountz.lz4.LZ4HCJavaUnsafeLongCompressor.compressUnchecked(LZ4HCJavaUnsafeLongCompressor.java:172)
    at net.jpountz.lz4.LZ4HCJavaUnsafeLongCompressor.compress(LZ4HCJavaUnsafeLongCompressor.java:327)
    at net.jpountz.lz4.LZ4Test$ByteBufferTester.compress(LZ4Test.java:310)
    at net.jpountz.lz4.LZ4Test$ByteBufferTester.compress(LZ4Test.java:1)
    at net.jpountz.lz4.LZ4Test.testRoundTrip(LZ4Test.java:117)
    at net.jpountz.lz4.LZ4Test.testRoundTrip(LZ4Test.java:361)
    at net.jpountz.lz4.LZ4Test.testRoundTrip(LZ4Test.java:368)
    at net.jpountz.lz4.LZ4Test.testRoundTrip(LZ4Test.java:377)
    at net.jpountz.lz4.LZ4Test.testRoundTrip(LZ4Test.java:382)
    at net.jpountz.lz4.LZ4Test.testAllEqual(LZ4Test.java:482)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:682)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:385)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:556)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:79)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$1.run(RandomizedRunner.java:492)

I think the cast to an int is not correct in addHash and things should work just fine by removing this cast and making delta a long?

blambov commented 9 years ago

Thanks for catching this. Apparently Windows will not give you large addresses until you use that much memory. This is now fixed, plus a new "large" test is added to make sure this and similar problems are also caught on Windows. I moved this to a separate test target as it requires 6 gigs of memory (to test both negative ints and truncation).

I don't know how important it is for Cassandra, but if you don't care too much about performance on 32-bits systems, I am personally ok with not even trying to optimize for these systems. I tend to be more scared by the potential bugs it could create that I would not find since I almost always run a 64-bits JVM, than I am tempted by the performance improvements. It could also make the code easier to maintain.

Well, the int offset versions of the templates will be pretty well tested on a 64-bit JVM as they will be used for the Safe implementations. These will always require 32-bit array indices.

Even without special 32-bit unsafe code there will still have to be 32-bit (for Safe) and 64-bit (for Unsafe with direct buffers) instances of the templates; all the template complexity and maintenance pain will still be there, we will only instantiate it one less time. I personally don't think that's enough reason to accept a 2x hit. I don't think that is a rare target either, as people still often use 32-bit JVMs to save memory.

jpountz commented 9 years ago

I merged the change, thanks!

people still often use 32-bit JVMs to save memory

Is it still true with the compressed oops optimization on?

blambov commented 9 years ago

Thank you for resyncing and merging this. One small thing: in unsafe_utils the new readLongLE method takes an int offset while it should be @{OffsetType}. This is not currently a problem as it is not used by any of the Long code, but you may want to fix it as it might cause an unpleasant surprise the first time someone tries to use it.

Is it still true with the compressed oops optimization on?

Probably not as much, but as an example Google was using 32-bit JVMs by default until fairly recently.

jpountz commented 9 years ago

One small thing: in unsafe_utils the new readLongLE method takes an int offset while it should be @{OffsetType}.

Oops, I merged too quickly with the xxh64 change, thanks for the catch!

Probably not as much, but as an example Google was using 32-bit JVMs by default until fairly recently.

Interesting.

jpountz commented 9 years ago

@blambov FYI I was playing with compressing ByteBuffers and it seems possible to make the JVM segfault by providing a read-only heap ByteBuffer. The problem seems to be that we assume that buffers that don't have an array are direct:

    if (srcBuf.hasArray()) {
      src = srcBuf.array();
      srcOff += srcBuf.arrayOffset();
    } else {
      src = null;
      srcOff += getBufferOffsetFromNull(srcBuf);
    }

Here is a patch that makes the tests segfault:

diff --git a/src/java/net/jpountz/lz4/LZ4Factory.java b/src/java/net/jpountz/lz4/LZ4Factory.java
index 7478c41..d8d7487 100644
--- a/src/java/net/jpountz/lz4/LZ4Factory.java
+++ b/src/java/net/jpountz/lz4/LZ4Factory.java
@@ -15,6 +15,7 @@ package net.jpountz.lz4;
  */

 import java.lang.reflect.Field;
+import java.nio.ByteBuffer;
 import java.util.Arrays;

 import net.jpountz.util.Native;
diff --git a/src/test/net/jpountz/lz4/AbstractLZ4RoundtripTest.java b/src/test/net/jpountz/lz4/AbstractLZ4RoundtripTest.java
index 1728cb4..8f2e6a3 100644
--- a/src/test/net/jpountz/lz4/AbstractLZ4RoundtripTest.java
+++ b/src/test/net/jpountz/lz4/AbstractLZ4RoundtripTest.java
@@ -90,7 +90,11 @@ public abstract class AbstractLZ4RoundtripTest extends AbstractLZ4Test {
       abstract ByteBuffer allocate(int len);

       ByteBuffer copy(byte[] src) {
-        return allocate(src.length).put(src);
+        ByteBuffer copy = allocate(src.length).put(src);
+        if (randomBoolean()) {
+          copy = copy.asReadOnlyBuffer();
+        }
+        return copy;
       }

       ByteBuffer slice(ByteBuffer src, int start, int end) {

blambov commented 9 years ago

Sorry that I didn't clear this up. I will make a separate patch for read-only buffers. How would you prefer to handle them:

forbid processing read-only buffers altogether;
process them without hacking into the bytebuffer fields, copying input data if necessary;
hack into the buffer and effectively circumvent the read-only-ness to get a pointer?

jpountz commented 9 years ago

Good questions... I wanted to add ByteBuffer support to the xxhash APIs yesterday and faced a similar issue. I ended up falling back to the plain Java impl when the buffer is not direct and doesn't expose an array (see XXHash32JNI for instance) so that it uses the ByteBuffer APIs. Maybe we should do the same here? I'm wondering that we should also add checks that the destination buffer is not read-only before starting to (de)compress?

jpountz commented 9 years ago

I just ran tests on a 32-bits VM and got test failures because of an assertion error:

[junit4:junit4]    > Throwable #1: java.lang.AssertionError: -1279983627
[junit4:junit4]    >    at __randomizedtesting.SeedInfo.seed([92358F656E1C083D:324ADA26DFF33C73]:0)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4HCJavaUnsafeCompressor$HashTable.addHash(LZ4HCJavaUnsafeCompressor.java:65)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4HCJavaUnsafeCompressor$HashTable.insert(LZ4HCJavaUnsafeCompressor.java:75)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4HCJavaUnsafeCompressor$HashTable.insertAndFindBestMatch(LZ4HCJavaUnsafeCompressor.java:85)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4HCJavaUnsafeCompressor.compressUnchecked(LZ4HCJavaUnsafeCompressor.java:181)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4HCJavaUnsafeCompressor.compress(LZ4HCJavaUnsafeCompressor.java:337)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest$ByteBufferTester.compress(AbstractLZ4RoundtripTest.java:118)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest$ByteBufferTester.compress(AbstractLZ4RoundtripTest.java:92)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:227)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:310)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:318)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:327)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:332)
[junit4:junit4]    >    at net.jpountz.lz4.AbstractLZ4RoundtripTest.testRoundTrip(AbstractLZ4RoundtripTest.java:337)
[junit4:junit4]    >    at net.jpountz.lz4.LZ4Test.testRoundtripPic(LZ4Test.java:118)

Will try to understand what happens here. It seems due to an overflow of the current address in the byte buffer.

drcrallen commented 9 years ago

@jpountz : any news on if this is stable enough to get into a release? I'm interested in implementing this with our use of ByteBuffers in Druid's CompressedObjectStrategy

lz4 / lz4-java

Implements ByteBuffer support for all LZ4 methods #49