Randomly crashed service , may related to pthread_cond_wait

tuanm0 commented 2 years ago

Hello,

I'm currently using Version: 10.4.1.6-1~bpo11+1 installed from apt. My debian version is : debian 5.10.0-16-amd64

I experienced service crashed randomly lately and unfortunately, i don't have any core, and very few logs on the reason of the crash.

On every crashes, the messages are the same: on kernel : Jul 23 09:54:50 debian kernel: [123607.273509] poller[23513]: segfault at 7f2b26911000 ip 00007f2b2fbb8e48 sp 00007f2b2690c3c8 error 4 in libc-2.31.so[7f2b2fa7b000+14b000] on syslog :

Jul 23 09:54:50 debian systemd[1]: rtpengine-daemon.service: Main process exited, code=killed, status=11/SEGV
Jul 23 09:54:50 debian systemd[1]: rtpengine-daemon.service: Failed with result 'signal'.
Jul 23 09:54:50 debian systemd[1]: rtpengine-daemon.service: Consumed 1d 47min 42.989s CPU time.

After attached gdb on currently crashed service I got :

(gdb) bt
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55555565bd48) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55555565bd60, cond=0x55555565bd20) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x55555565bd20, mutex=0x55555565bd60) at pthread_cond_wait.c:638
#3  0x000055555557de08 in threads_join_all ()
#4  0x0000555555573d30 in main ()

For now i can't afford to make more tests and crashes as these devices are in production.

I hope somebody will have ideas on the reasons.

Hoping this might help others not to have such crashes. Any helps would be much appreciated !

tuanm0 commented 2 years ago

Here is the full bt :

(gdb) bt full
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55555565bd48) at ../sysdeps/nptl/futex-internal.h:186
        __ret = -512
        oldtype = 0
        err = <optimized out>
        oldtype = <optimized out>
        err = <optimized out>
        __ret = <optimized out>
        resultvar = <optimized out>
        __arg4 = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a4 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55555565bd60, cond=0x55555565bd20) at pthread_cond_wait.c:508
        spin = 0
        buffer = {__routine = 0x7ffff5a0d540 <__condvar_cleanup_waiting>, __arg = 0x7fffffffe930, __canceltype = 0, __prev = 0x0}
        cbuffer = {wseq = 0, cond = 0x55555565bd20, mutex = 0x55555565bd60, private = 0}
        err = <optimized out>
        g = 4294961424
        flags = <optimized out>
        g1_start = <optimized out>
        maxspin = 0
        signals = <optimized out>
        result = 0
        wseq = 0
        seq = 0
        private = 0
        maxspin = <optimized out>
        err = <optimized out>
        result = <optimized out>
        wseq = <optimized out>
        g = <optimized out>
        seq = <optimized out>
        flags = <optimized out>
        private = <optimized out>
        signals = <optimized out>
        done = <optimized out>
        g1_start = <optimized out>
        spin = <optimized out>
        buffer = {__routine = <optimized out>, __arg = <optimized out>, __canceltype = <optimized out>, __prev = <optimized out>}
        cbuffer = {wseq = <optimized out>, cond = <optimized out>, mutex = <optimized out>, private = <optimized out>}
        s = <optimized out>
#2  __pthread_cond_wait (cond=0x55555565bd20, mutex=0x55555565bd60) at pthread_cond_wait.c:638
No locals.
#3  0x000055555557de08 in threads_join_all ()
No symbol table info available.
#4  0x0000555555573d30 in main ()
No symbol table info available.

rfuchs commented 2 years ago

How do you attach gdb to a crashed service without a core?

That backtrace is looking at the wrong thread. This is just the main thread waiting for the other threads to shut down. It's unlikely that you're actually looking at a crashed process, but just in case: the command to pull up a backtrace of all other threads is threads apply all bt

tuanm0 commented 2 years ago

Thank you for your reply, I will try it next time reproducing the issue.

tuanm0 commented 2 years ago

as it turned out was the transcoding error , it stopped showing up when we disable transcoding through.

sipwise / rtpengine

Randomly crashed service , may related to pthread_cond_wait #1516