Closed angainor closed 4 years ago
Hi Marcin,
Sorry you're having trouble, and thanks for the report. There shouldn't be any constraints on the block size used (except for extremely large sizes close to 2^31 which might run up against the JVM's array size limits); the most likely cause of this is a race condition causing a deadlock somewhere in the code. It could be exacerbated by a low block size (which would result in more frequent interthread communication/synchronization) and, possibly, your specific data set (by virtue of being larger or easier/harder to compress).
We have tests to guard against such concurrency bugs (and I've never previously observed a deadlock), but they are rather insidious and could be specific to your architecture, OS, JVM, or even specific hardware. Could you please let me know:
If possible, could you provide stack dumps for all threads while MiGz is deadlocked?
If we're able to replicate the hang we should be able to fix it.
Thank you, Jeff
Thanks for your answer, Jeff. The trick is I run your library from inside MATLAB
Operating System: Linux 5.3.0-45-generic #37-Ubuntu SMP Thu Mar 26 20:41:27 UTC 2020 x86_64
Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
This is how I use MiGz:
os = java.io.ByteArrayOutputStream();
zos = com.linkedin.migz.MiGzOutputStream(os, 1, 50*1024);
zos.setCompressionLevel(3);
zos.write(bytes);
As you see, the problem happens also when using only 1 thread. It happens always for this data set, doesn't seem to be related to thread synchronization / deadlock. Getting a stack dump might be tricky though, because there are plenty of other threads started by MATLAB. But I will try to have a look tomorrow. I guess I can also add some printf / debug statements in the code to see exactly where it happens. If you have any suggestions, please let me know.
Thanks Marcin--this is very helpful.
When only a single compression thread is used, there is still the potential for deadlock because this is a new, independent thread and the calling thread must communicate/synchronize with it. The fact that it consistently happens is very interesting, but since CPU usage is 0% while it hangs a deadlock is still the most likely cause in my opinion (since it's clearly not trapped in a loop or similar.)
There's a possibility that executing within MATLAB might have something to do with the issue (for all I know it hangs if Java throws an OOM exception :) ), but I doubt it. For now I can try to reproduce the issue using your configuration; if its possible for you to share your data, that would also be helpful.
In terms of getting the stack traces, the easiest way to do this is to attach a Java debugger to the hung process; I wouldn't suggest printfs or similar as that would be far too time consuming/cumbersome.
Hi, Jeff! MATLAB starts ~100 threads on it's own. Any idea how to get the TID of the affected Java thread?
In the meantime, the code seems to be hanging in _currentBuffer = _writeBufferPool.take();
. This is the local variables at that point:
System.out.println(" scheduleCurrentBlock " + block + " " + buff + " " + length);
scheduleCurrentBlock 23 [B@59d2103b 51200
I get a printf after _threadPool.submit()
, but I do not get a printf after _writeBufferPool.take();
. Maybe that helps?
Thanks Marcin--System.out.println(...) is suspect because the output may not be flushed if, e.g. the program hangs or crashes (you may want to use System.err, which [by default] auto-flushes on each print). Still, this suggests that the crash is during compression rather than while "cleaning up" writer threads at the end, which is helpful.
What I've been able to do (using incompressible, random data and your MiGz configuration) is identify an error where the size of the DEFLATE-d incompressible block appears to be larger than it should be (per the relevant RFC), which causes an exception because MiGz (fortunately) checks for this problem "just in case".
The exception is then passed to the caller. If your data is, in part, incompressible, this may be the cause of the hang you're seeing (and would mesh with the issue being consistently reproducible, which a race condition typically wouldn't be). Since you're not using a try-with-resources or try-finally to close the stream when an exception occurs (closing shuts down all extant threads), the leftover writer thread could, depending on how MATLAB handles exceptions, result in a hang.
I'll update once we have a patch for the unexpectedly-large-block problem and hopefully that will prove to be a solution for your issue :)
Thanks, Jeff! this is very likely - part of my data is already compressed and hence your scenario seems very plausible. Please let me know when you have a patch, I'll gladly test it!
Hi Marcin,
The underlying issue turned out to be that, although the relevant RFC implies that DEFLATE will use a 32KB block size for uncompressable blocks, this is not a requirement and zlib (which backs Java's gzip implementation) defaults to 16KB. A patch that corrects this worst-case compressed size estimate accordingly has been committed and the updated library (as v.1.0.1) should be available in the central repository soon. Please let me know if this new version solves your problem.
Thanks, Jeff
Hi, Jeff! I tested the new release, things work smoothly now :) Thanks a lot for a quick fix!
Thanks for confirming, Marcin, and thanks for raising the issue so we could fix this :)
I have a case where MiGz hangs when I set a block size below ~120*1024 bytes. This happens for a single data set I found, not always. So it is input-specific. Code hangs inside
os.write()
, CPU usage is 0, so it seems nothing is being done. I suspect it hangs insidescheduleCurrentBlock(false);
Are there any constraints on the block size that can be used? or is this a bug?