lz4 / lz4-java

LZ4 compression for Java
Apache License 2.0
1.11k stars 253 forks source link

Dead lock in LZ4Factory #152

Open patelh opened 4 years ago

patelh commented 4 years ago

We are using version 1.5.1 and we've seen multiple instances where we have dead lock in LZ4Factory. Spark pipeline hangs and we end up killing it and restarting. Doesn't happen all the time. In this case, we see 8 threads blocked.

"shuffle-server-5-4" #183 daemon prio=5 os_prio=0 tid=0x00007f45e5369000 nid=0x1fe6 waiting for monitor entry [0x00007f45910d4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at net.jpountz.lz4.LZ4Factory.nativeInstance(LZ4Factory.java:83)
    - waiting to lock <0x00000003c2f8ddf8> (a java.lang.Class for net.jpountz.lz4.LZ4Factory)
    at net.jpountz.lz4.LZ4Factory.fastestInstance(LZ4Factory.java:157)
    at net.jpountz.lz4.LZ4BlockOutputStream.<init>(LZ4BlockOutputStream.java:138)
    at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:117)
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
    at org.apache.spark.serializer.SerializerManager.dataSerializeWithExplicitClassTag(SerializerManager.scala:193)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:610)
    at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
    at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
"shuffle-server-5-3" #173 daemon prio=5 os_prio=0 tid=0x00007f45e53a0000 nid=0x1eee waiting for monitor entry [0x00007f45924d9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at net.jpountz.lz4.LZ4Factory.nativeInstance(LZ4Factory.java:83)
    - waiting to lock <0x00000003c2f8ddf8> (a java.lang.Class for net.jpountz.lz4.LZ4Factory)
    at net.jpountz.lz4.LZ4Factory.fastestInstance(LZ4Factory.java:157)
    at net.jpountz.lz4.LZ4BlockOutputStream.<init>(LZ4BlockOutputStream.java:138)
    at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:117)
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
    at org.apache.spark.serializer.SerializerManager.dataSerializeWithExplicitClassTag(SerializerManager.scala:193)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:610)
    at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
    at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
patelh commented 4 years ago

For the time being, we've overcome this by caching the instance via backport of https://github.com/apache/spark/pull/24905/files to Spark 2.3.2

odaira commented 4 years ago

I have not yet understood what was happening. Can I have the full thread dump at the time a deadlock happened?

patelh commented 4 years ago

The threads are waiting on the same monitor. We haven't been able to root cause this yet. It is possible the JVM was in bad state due to heap corruption. We've upgraded java to latest release to see if we get different behavior.