Closed LordOfDragons closed 3 years ago
The libjpeg API does not support multithreading unless you are using a separate jpeg_compress_struct
or jpeg_decompress_struct
instance for each thread. If you are not doing that, then your application is relying on unsupported and undefined behavior. If you are maintaining a separate instance for each thread, then I have no other ideas. It's impossible for me to debug an issue without being able to reproduce it. My own application software uses multithreading with libjpeg-turbo, so I know it works.
Note that the crash you describe, which is likely due to a double free()
, is exactly the sort of issue that can occur if you have multiple threads trying to use the same jpeg_compress_struct
or jpeg_decompress_struct
instance.
Each loading task (processed by one thread only) uses an own set of lib_jpeg structures so multiple threads accessing the same files is not happening. Valgrind and GCC Santizier also report no invalid access to the data structure.
For this reason whatever happens here has to come from inside libjpeg.
Let's try doing this analytically. The crash happens inside "jpeg_abort" which is called from inside "jpeg_finish_decompress". Can you check what kind of situations potentially call "jpeg_abort" from inside "jpeg_finish_decompress"? Maybe this helps narrowing down what situation is present looking at the data structure captures above.
EDIT: My idea is to located what kind of data structure member has been attempted to free that is not 0.
If you can provide code and example data that demonstrates the problem, then I can investigate it. Feel free to contact me via e-mail if you don't want to share the code/data publicly.
Otherwise, this project is in a negative funding situation right now. I had to borrow against anticipated funding from next year in order to release 2.1 beta, so I am not in a position to spend hours of unpaid labor trying to blindly reproduce a bug caused by code and data whose behavior is unknowable to me. At this point, there is no compelling evidence that this issue is caused by a bug in libjpeg-turbo.
As you can see from the libjpeg-turbo source code, jpeg_finish_decompress()
always calls jpeg_abort()
.
One thing to be aware of is the fact that you cannot continue calling libjpeg API functions after a libjpeg error has occurred, so your code needs to handle those errors. If an uncaught error occurred and you continued to make libjpeg API calls, that might also lead to a problem such as this. I am also unsure whether jpeg_finish_decompress()
is safely idempotent, so it would be a good idea to avoid multiple successive calls to it with the same structure instance.
The code itself is open ( https://github.com/LordOfDragons/dragengine/tree/master/src/modules/image/jpeg/src ) but the media files are not.
I didn't know the caveat about the error handling. Can I detect somehow if libjpeg failed to disable clean up calls?
Bearing in mind that I didn't develop the libjpeg API, the only way of which I am aware that an application can catch errors from it is by using setjmp()
, as demonstrated in example.txt. The TurboJPEG API provides a more straightforward return-code-based error handling mechanism and may be a better choice for your program, assuming it doesn't need the more advanced libjpeg API features.
I don't think so. It's basically just reading/writing images en-block. All processing happens elsewhere. Are there any portability caveats with the turbo-jpeg API?
No portability caveats of which I'm aware, but it does bear mentioning that the TurboJPEG SDK is not distributed by very many O/S's that distribute libjpeg-turbo (although you can always install one of our official packages in order to get the SDK.) The main functional caveats with the TurboJPEG API are that it doesn't support:
For most common use cases, though, it should suffice.
Yeah, decompressing/compressing is done entirely from memory stream classes which can be backed up by a single file stream but often they are streams from compressed archives. I'll then better stick to the original jpeg API and see if I can figure out what happens here.
I've used now LLDB to reproduce the crash. The backtrace looks like this:
* thread #5, name = 'deigde', stop reason = signal SIGABRT
frame #0: 0x00007ffff70ef664 libc.so.6`raise + 164
libc.so.6`raise:
-> 0x7ffff70ef664 <+164>: movq 0x108(%rsp), %rax
0x7ffff70ef66c <+172>: xorq %fs:0x28, %rax
0x7ffff70ef675 <+181>: jne 0x7ffff70ef69c ; <+220>
0x7ffff70ef677 <+183>: movl %r8d, %eax
(lldb) bt
* thread #5, name = 'deigde', stop reason = signal SIGABRT
* frame #0: 0x00007ffff70ef664 libc.so.6`raise + 164
frame #1: 0x00007ffff70d9537 libc.so.6`abort + 274
frame #2: 0x00007ffff71306d0 libc.so.6`__libc_message + 592
frame #3: 0x00007ffff713824a libc.so.6`malloc_printerr + 26
frame #4: 0x00007ffff7138a6c libc.so.6`unlink_chunk.isra.0 + 156
frame #5: 0x00007ffff713a1cb libc.so.6`_int_free + 1659
frame #6: 0x00007ffff6d94865 libjpeg.so.62`free_pool + 101
frame #7: 0x00007ffff6d78401 libjpeg.so.62`jpeg_abort + 33
frame #8: 0x00007ffff6d7c491 libjpeg.so.62`jpeg_finish_decompress + 113
frame #9: 0x00007ffff0e81041 libimgjpeg.so`deJpegModule::LoadImage(decBaseFileReader&, deImage&, deBaseImageInfo&) at deJpegModule.cpp:224:25
This means the jpeg_finish_decompress causing the crash has been called from the regular run of jpeg not causing any errors. I'll see if I can somehow get more information out of it.
Really weird. That looks like a double free or some other attempt to free memory that isn't allocated, but I don't know why valgrind isn't detecting it. Have you tried running Helgrind? That might help root out race conditions in your code that could somehow be causing memory corruption.
It's possible valgrind influences the execution order in a way the problem is hard to get a hold of. I'll see if I get something out of hellgrind though.
(post content no more valid)
If I could reproduce it, I'd be happy to fix it.
At this point, what is probably going to be necessary to get it fixed is for you to break out your JPEG module into a C/C++ example that I can easily build and run to repro the issue. Otherwise, I have no insight into what's going on and no idea whether it's a legitimate bug in libjpeg-turbo or a usage issue.
Eventually managed to get an ASAN trace of the problem. Problem originated from a different module where a typo caused GCC to silently use a copy-constructor. Why no memory tool (including ASAN) got triggered by this one is beyond me. I need to do longer testing but right now I would say this issue here can be really closed (or deleted, whatever your policies are).
Glad you were able to figure it out. GitHub doesn't allow for deleting posts, and I prefer to keep them around anyhow. Someone else may experience an issue with similar symptoms, and there is value to being able to Google the symptoms, stumble upon this bug report, and figure out that the symptoms are likely due to a bug in the calling program rather than a bug in libjpeg-turbo.
Running libjpeg-turbo 2.0.5-r2 on GenToo (64-bit). Library is used in multi-threaded image loading code. At the time of crash only one image is loading but I mention MT anyway since it is unclear what causes the issue.
This crash happens randomly at a low occurrence rate but sometimes multiple times per day. Other image libraries like libpng do not show this behavior so I'm rather sure this is a libjpeg specific problem. Due to the random nature it could be an MT problem. The code surrounding libjpeg usage is reviewed and SAN checked so it is unlikely this is the cause.
This is the stack trace of one of the crashes:
The basic code flow is like this (pseudo-code C++):
These are GDB printing of the jpeg structures at crash time:
The crash thus happens after loading the image finished and jpeg is torn down.
I know a bug like this is difficult, especially since I can not provide the images potentially causing it. But maybe something like this happened earlier or maybe the stack trace allows to check what kind of code in libjpeg caused the problem. I assume figuring out what can cause jpeg_abort() to be called from inside jpeg_finish_decompress() might help figure out what happens here.