received signal SIGABRT in jpeg_finish_decompress (MT)

LordOfDragons commented 3 years ago

Running libjpeg-turbo 2.0.5-r2 on GenToo (64-bit). Library is used in multi-threaded image loading code. At the time of crash only one image is loading but I mention MT anyway since it is unclear what causes the issue.

This crash happens randomly at a low occurrence rate but sometimes multiple times per day. Other image libraries like libpng do not show this behavior so I'm rather sure this is a libjpeg specific problem. Due to the random nature it could be an MT problem. The code surrounding libjpeg usage is reviewed and SAN checked so it is unlikely this is the cause.

This is the stack trace of one of the crashes:

Thread 5 "deigde" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4914700 (LWP 26515)]
0x00007ffff70f14d4 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff70f14d4 in raise () from /lib64/libc.so.6
#1  0x00007ffff70db53c in abort () from /lib64/libc.so.6
#2  0x00007ffff7133670 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff713aa8a in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff713c9ea in _int_free () from /lib64/libc.so.6
#5  0x00007ffff6d98865 in free_pool () from /usr/lib64/libjpeg.so.62
#6  0x00007ffff6d7c401 in jpeg_abort () from /usr/lib64/libjpeg.so.62
#7  0x00007ffff6d80491 in jpeg_finish_decompress () from /usr/lib64/libjpeg.so.62
...

The basic code flow is like this (pseudo-code C++):

jpeg_create_decompress(decompress)
jpeg_read_header(decompress, TRUE)
jpeg_start_decompress(decompress)
while decompress.output_scanline < height:
  row = imageData + rowLength * decompress.output_scanline
  jpeg_read_scanlines(decompress, row, 1)
jpeg_finish_decompress(decompress)
jpeg_destroy_decompress(decompress)

These are GDB printing of the jpeg structures at crash time:

pDecompress = {
    err = 0x7fff6c001658,
    mem = 0x7fff6c0250d0,
    progress = 0x0,
    client_data = 0x7fff6c0013c0,
    is_decompressor = 1,
    global_state = 210,
    src = 0x7fff6c001700,
    image_width = 1024,
    image_height = 1024,
    num_components = 3,
    jpeg_color_space = JCS_YCbCr,
    out_color_space = JCS_RGB,
    scale_num = 1,
    scale_denom = 1,
    output_gamma = 1,
    buffered_image = 0,
    raw_data_out = 0,
    dct_method = JDCT_ISLOW,
    do_fancy_upsampling = 1,
    do_block_smoothing = 1,
    quantize_colors = 0,
    dither_mode = JDITHER_FS,
    two_pass_quantize = 1,
    desired_number_of_colors = 256,
    enable_1pass_quant = 0,
    enable_external_quant = 0,
    enable_2pass_quant = 0,
    output_width = 1024,
    output_height = 1024,
    out_color_components = 3,
    output_components = 3,
    rec_outbuf_height = 1,
    actual_number_of_colors = 0,
    colormap = 0x0,
    output_scanline = 1024,
    input_scan_number = 10,
    input_iMCU_row = 128,
    output_scan_number = 10,
    output_iMCU_row = 128,
    coef_bits = 0x7fff6c008140,
    quant_tbl_ptrs = {0x7fff6c0242c0, 0x7fff6c024360, 0x0, 0x0},
    dc_huff_tbl_ptrs = {0x7fff6c024400, 0x7fff6c024520, 0x0, 0x0},
    ac_huff_tbl_ptrs = {0x7fff6c024640, 0x7fff6c10fea0, 0x0, 0x0},
    data_precision = 8,
    comp_info = 0x7fff6c0074e0,
    progressive_mode = 1,
    arith_code = 0,
    arith_dc_L = '\000' <repeats 15 times>,
    arith_dc_U = '\001' <repeats 16 times>,
    arith_ac_K = '\005' <repeats 16 times>,
    restart_interval = 0,
    saw_JFIF_marker = 1,
    JFIF_major_version = 1 '\001',
    JFIF_minor_version = 1 '\001',
    density_unit = 1 '\001',
    X_density = 72,
    Y_density = 72,
    saw_Adobe_marker = 0,
    Adobe_transform = 0 '\000',
    CCIR601_sampling = 0,
    marker_list = 0x0,
    max_h_samp_factor = 2,
    max_v_samp_factor = 1,
    min_DCT_scaled_size = 8,
    total_iMCU_rows = 128,
    sample_range_limit = 0x7fff6c007700 "",
    comps_in_scan = 1,
    cur_comp_info = {0x7fff6c0074e0, 0x0, 0x0, 0x0},
    MCUs_per_row = 128,
    MCU_rows_in_scan = 128,
    blocks_in_MCU = 1,
    MCU_membership = {0, 0, 1, 2, 0, 0, 0, 0, 0, 0},
    Ss = 1,
    Se = 63,
    Ah = 1,
    Al = 0,
    unread_marker = 0,
    master = 0x7fff6c024220,
    main = 0x7fff6c0087a0,
    coef = 0x7fff6c008440,
    post = 0x7fff6c007d00,
    inputctl = 0x7fff6c0241e0,
    marker = 0x7fff6c0240c0,
    entropy = 0x7fff6c0080c0,
    idct = 0x7fff6c007d40,
    upsample = 0x7fff6c007bc0,
    cconvert = 0x7fff6c007b80,
    cquantize = 0x0
},
pErrorMgr = {
    error_exit = 0x7ffff0ea1486 <dejpegErrorExit(jpeg_common_struct*)>, 
    emit_message = 0x7ffff0ea1d60 <dejpegEmitMessage(jpeg_common_struct*, int)>,
    output_message = 0x0,
    format_message = 0x0,
    reset_error_mgr = 0x7ffff0ea1d70 <dejpegResetErrorMgr(jpeg_common_struct*)>,
    msg_code = 85,
    msg_parm = {i = {1, 63, 1, 0, 0, 0, 0, 0},
    s = "\001\000\000\000?\000\000\000\001", '\000' <repeats 70 times>},
    trace_level = 0,
    num_warnings = 0,
    jpeg_message_table = 0x0,
    last_jpeg_message = 0,
    addon_message_table = 0x0,
    first_addon_message = 0,
    last_addon_message = 0
},
pSourceMgr = {
    next_input_byte = 0x7fff6c047651 "\301\031\202\223a\031@\377",
    bytes_in_buffer = 0,
    init_source = 0x7ffff0ea1d80 <dejpegInitSource(jpeg_decompress_struct*)>,
    fill_input_buffer = 0x7ffff0ea1d90 <dejpegFillInputBuffer(jpeg_decompress_struct*)>,
    skip_input_data = 0x7ffff0ea1db0 <dejpegSkipInputData(jpeg_decompress_struct*, long)>,
    resync_to_restart = 0x0,
    term_source = 0x7ffff0ea22a0 <dejpegTermSource(jpeg_decompress_struct*)>
}

The crash thus happens after loading the image finished and jpeg is torn down.

I know a bug like this is difficult, especially since I can not provide the images potentially causing it. But maybe something like this happened earlier or maybe the stack trace allows to check what kind of code in libjpeg caused the problem. I assume figuring out what can cause jpeg_abort() to be called from inside jpeg_finish_decompress() might help figure out what happens here.

dcommander commented 3 years ago

The libjpeg API does not support multithreading unless you are using a separate jpeg_compress_struct or jpeg_decompress_struct instance for each thread. If you are not doing that, then your application is relying on unsupported and undefined behavior. If you are maintaining a separate instance for each thread, then I have no other ideas. It's impossible for me to debug an issue without being able to reproduce it. My own application software uses multithreading with libjpeg-turbo, so I know it works.

Note that the crash you describe, which is likely due to a double free(), is exactly the sort of issue that can occur if you have multiple threads trying to use the same jpeg_compress_struct or jpeg_decompress_struct instance.

LordOfDragons commented 3 years ago

Each loading task (processed by one thread only) uses an own set of lib_jpeg structures so multiple threads accessing the same files is not happening. Valgrind and GCC Santizier also report no invalid access to the data structure.

For this reason whatever happens here has to come from inside libjpeg.

Let's try doing this analytically. The crash happens inside "jpeg_abort" which is called from inside "jpeg_finish_decompress". Can you check what kind of situations potentially call "jpeg_abort" from inside "jpeg_finish_decompress"? Maybe this helps narrowing down what situation is present looking at the data structure captures above.

EDIT: My idea is to located what kind of data structure member has been attempted to free that is not 0.

dcommander commented 3 years ago

If you can provide code and example data that demonstrates the problem, then I can investigate it. Feel free to contact me via e-mail if you don't want to share the code/data publicly.

Otherwise, this project is in a negative funding situation right now. I had to borrow against anticipated funding from next year in order to release 2.1 beta, so I am not in a position to spend hours of unpaid labor trying to blindly reproduce a bug caused by code and data whose behavior is unknowable to me. At this point, there is no compelling evidence that this issue is caused by a bug in libjpeg-turbo.

As you can see from the libjpeg-turbo source code, jpeg_finish_decompress() always calls jpeg_abort().

One thing to be aware of is the fact that you cannot continue calling libjpeg API functions after a libjpeg error has occurred, so your code needs to handle those errors. If an uncaught error occurred and you continued to make libjpeg API calls, that might also lead to a problem such as this. I am also unsure whether jpeg_finish_decompress() is safely idempotent, so it would be a good idea to avoid multiple successive calls to it with the same structure instance.

LordOfDragons commented 3 years ago

The code itself is open ( https://github.com/LordOfDragons/dragengine/tree/master/src/modules/image/jpeg/src ) but the media files are not.

I didn't know the caveat about the error handling. Can I detect somehow if libjpeg failed to disable clean up calls?

dcommander commented 3 years ago

Bearing in mind that I didn't develop the libjpeg API, the only way of which I am aware that an application can catch errors from it is by using setjmp(), as demonstrated in example.txt. The TurboJPEG API provides a more straightforward return-code-based error handling mechanism and may be a better choice for your program, assuming it doesn't need the more advanced libjpeg API features.

LordOfDragons commented 3 years ago

I don't think so. It's basically just reading/writing images en-block. All processing happens elsewhere. Are there any portability caveats with the turbo-jpeg API?

dcommander commented 3 years ago

No portability caveats of which I'm aware, but it does bear mentioning that the TurboJPEG SDK is not distributed by very many O/S's that distribute libjpeg-turbo (although you can always install one of our official packages in order to get the SDK.) The main functional caveats with the TurboJPEG API are that it doesn't support:

12-bit-per-component JPEG
custom source/destination managers (which are often used to support buffered I/O to/from disk, but if you're compressing/decompressing entirely to/from memory, then this is moot)
custom progressive scan scripts
unusual subsampling configurations (anything other than 4:2:0, 4:2:2, 4:1:1, 4:4:0, 4:4:4, or grayscale)

For most common use cases, though, it should suffice.

LordOfDragons commented 3 years ago

Yeah, decompressing/compressing is done entirely from memory stream classes which can be backed up by a single file stream but often they are streams from compressed archives. I'll then better stick to the original jpeg API and see if I can figure out what happens here.

LordOfDragons commented 3 years ago

I've used now LLDB to reproduce the crash. The backtrace looks like this:

* thread #5, name = 'deigde', stop reason = signal SIGABRT
    frame #0: 0x00007ffff70ef664 libc.so.6`raise + 164
libc.so.6`raise:
->  0x7ffff70ef664 <+164>: movq   0x108(%rsp), %rax
    0x7ffff70ef66c <+172>: xorq   %fs:0x28, %rax
    0x7ffff70ef675 <+181>: jne    0x7ffff70ef69c            ; <+220>
    0x7ffff70ef677 <+183>: movl   %r8d, %eax
(lldb) bt
* thread #5, name = 'deigde', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff70ef664 libc.so.6`raise + 164
    frame #1: 0x00007ffff70d9537 libc.so.6`abort + 274
    frame #2: 0x00007ffff71306d0 libc.so.6`__libc_message + 592
    frame #3: 0x00007ffff713824a libc.so.6`malloc_printerr + 26
    frame #4: 0x00007ffff7138a6c libc.so.6`unlink_chunk.isra.0 + 156
    frame #5: 0x00007ffff713a1cb libc.so.6`_int_free + 1659
    frame #6: 0x00007ffff6d94865 libjpeg.so.62`free_pool + 101
    frame #7: 0x00007ffff6d78401 libjpeg.so.62`jpeg_abort + 33
    frame #8: 0x00007ffff6d7c491 libjpeg.so.62`jpeg_finish_decompress + 113
    frame #9: 0x00007ffff0e81041 libimgjpeg.so`deJpegModule::LoadImage(decBaseFileReader&, deImage&, deBaseImageInfo&) at deJpegModule.cpp:224:25

This means the jpeg_finish_decompress causing the crash has been called from the regular run of jpeg not causing any errors. I'll see if I can somehow get more information out of it.

dcommander commented 3 years ago

Really weird. That looks like a double free or some other attempt to free memory that isn't allocated, but I don't know why valgrind isn't detecting it. Have you tried running Helgrind? That might help root out race conditions in your code that could somehow be causing memory corruption.

LordOfDragons commented 3 years ago

It's possible valgrind influences the execution order in a way the problem is hard to get a hold of. I'll see if I get something out of hellgrind though.

LordOfDragons commented 3 years ago

(post content no more valid)

dcommander commented 3 years ago

If I could reproduce it, I'd be happy to fix it.

dcommander commented 3 years ago

At this point, what is probably going to be necessary to get it fixed is for you to break out your JPEG module into a C/C++ example that I can easily build and run to repro the issue. Otherwise, I have no insight into what's going on and no idea whether it's a legitimate bug in libjpeg-turbo or a usage issue.

LordOfDragons commented 3 years ago

Eventually managed to get an ASAN trace of the problem. Problem originated from a different module where a typo caused GCC to silently use a copy-constructor. Why no memory tool (including ASAN) got triggered by this one is beyond me. I need to do longer testing but right now I would say this issue here can be really closed (or deleted, whatever your policies are).

dcommander commented 3 years ago

Glad you were able to figure it out. GitHub doesn't allow for deleting posts, and I prefer to keep them around anyhow. Someone else may experience an issue with similar symptoms, and there is value to being able to Google the symptoms, stumble upon this bug report, and figure out that the symptoms are likely due to a bug in the calling program rather than a bug in libjpeg-turbo.

libjpeg-turbo / libjpeg-turbo

received signal SIGABRT in jpeg_finish_decompress (MT) #485