hang with certain block sizes and data sets

angainor commented 4 years ago

I have a case where MiGz hangs when I set a block size below ~120*1024 bytes. This happens for a single data set I found, not always. So it is input-specific. Code hangs inside os.write(), CPU usage is 0, so it seems nothing is being done. I suspect it hangs inside scheduleCurrentBlock(false);

Are there any constraints on the block size that can be used? or is this a bug?

jeffpasternack commented 4 years ago

Hi Marcin,

Sorry you're having trouble, and thanks for the report. There shouldn't be any constraints on the block size used (except for extremely large sizes close to 2^31 which might run up against the JVM's array size limits); the most likely cause of this is a race condition causing a deadlock somewhere in the code. It could be exacerbated by a low block size (which would result in more frequent interthread communication/synchronization) and, possibly, your specific data set (by virtue of being larger or easier/harder to compress).

We have tests to guard against such concurrency bugs (and I've never previously observed a deadlock), but they are rather insidious and could be specific to your architecture, OS, JVM, or even specific hardware. Could you please let me know:

How are you invoking MiGz? Is it through the command-line interface or using the library in code?
Is the hanging behavior consistent when compressing your data set? I.e. does it always happen, or only sometimes?
What is your machine's architecture (e.g. x86), operating system, and JVM (e.g. OpenJDK)?
If possible, could you provide stack dumps for all threads while MiGz is deadlocked?

If we're able to replicate the hang we should be able to fix it.

Thank you, Jeff

angainor commented 4 years ago

Thanks for your answer, Jeff. The trick is I run your library from inside MATLAB

Operating System: Linux 5.3.0-45-generic #37-Ubuntu SMP Thu Mar 26 20:41:27 UTC 2020 x86_64
Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode

This is how I use MiGz:

            os = java.io.ByteArrayOutputStream();
            zos = com.linkedin.migz.MiGzOutputStream(os, 1, 50*1024);
            zos.setCompressionLevel(3);
            zos.write(bytes);

As you see, the problem happens also when using only 1 thread. It happens always for this data set, doesn't seem to be related to thread synchronization / deadlock. Getting a stack dump might be tricky though, because there are plenty of other threads started by MATLAB. But I will try to have a look tomorrow. I guess I can also add some printf / debug statements in the code to see exactly where it happens. If you have any suggestions, please let me know.

jeffpasternack commented 4 years ago

Thanks Marcin--this is very helpful.

When only a single compression thread is used, there is still the potential for deadlock because this is a new, independent thread and the calling thread must communicate/synchronize with it. The fact that it consistently happens is very interesting, but since CPU usage is 0% while it hangs a deadlock is still the most likely cause in my opinion (since it's clearly not trapped in a loop or similar.)

There's a possibility that executing within MATLAB might have something to do with the issue (for all I know it hangs if Java throws an OOM exception :) ), but I doubt it. For now I can try to reproduce the issue using your configuration; if its possible for you to share your data, that would also be helpful.

In terms of getting the stack traces, the easiest way to do this is to attach a Java debugger to the hung process; I wouldn't suggest printfs or similar as that would be far too time consuming/cumbersome.

angainor commented 4 years ago

Hi, Jeff! MATLAB starts ~100 threads on it's own. Any idea how to get the TID of the affected Java thread?

In the meantime, the code seems to be hanging in _currentBuffer = _writeBufferPool.take();. This is the local variables at that point:

System.out.println(" scheduleCurrentBlock " + block + " " + buff + " " + length);
scheduleCurrentBlock 23 [B@59d2103b 51200

I get a printf after _threadPool.submit(), but I do not get a printf after _writeBufferPool.take();. Maybe that helps?

jeffpasternack commented 4 years ago

Thanks Marcin--System.out.println(...) is suspect because the output may not be flushed if, e.g. the program hangs or crashes (you may want to use System.err, which [by default] auto-flushes on each print). Still, this suggests that the crash is during compression rather than while "cleaning up" writer threads at the end, which is helpful.

What I've been able to do (using incompressible, random data and your MiGz configuration) is identify an error where the size of the DEFLATE-d incompressible block appears to be larger than it should be (per the relevant RFC), which causes an exception because MiGz (fortunately) checks for this problem "just in case".

The exception is then passed to the caller. If your data is, in part, incompressible, this may be the cause of the hang you're seeing (and would mesh with the issue being consistently reproducible, which a race condition typically wouldn't be). Since you're not using a try-with-resources or try-finally to close the stream when an exception occurs (closing shuts down all extant threads), the leftover writer thread could, depending on how MATLAB handles exceptions, result in a hang.

I'll update once we have a patch for the unexpectedly-large-block problem and hopefully that will prove to be a solution for your issue :)

angainor commented 4 years ago

Thanks, Jeff! this is very likely - part of my data is already compressed and hence your scenario seems very plausible. Please let me know when you have a patch, I'll gladly test it!

jeffpasternack commented 4 years ago

Hi Marcin,

The underlying issue turned out to be that, although the relevant RFC implies that DEFLATE will use a 32KB block size for uncompressable blocks, this is not a requirement and zlib (which backs Java's gzip implementation) defaults to 16KB. A patch that corrects this worst-case compressed size estimate accordingly has been committed and the updated library (as v.1.0.1) should be available in the central repository soon. Please let me know if this new version solves your problem.

Thanks, Jeff

angainor commented 4 years ago

Hi, Jeff! I tested the new release, things work smoothly now :) Thanks a lot for a quick fix!

jeffpasternack commented 4 years ago

Thanks for confirming, Marcin, and thanks for raising the issue so we could fix this :)

linkedin / migz

hang with certain block sizes and data sets #3