sipwise / rtpengine

The Sipwise media proxy for Kamailio
GNU General Public License v3.0
778 stars 368 forks source link

RTPEngine randomly crash #1395

Open cicahrad opened 2 years ago

cicahrad commented 2 years ago

Hello, after i did make upgrade Debian from 10 to 11 RTPEngine randomly crash. Some times its runing few days and sometimes few hours and then crash whit core dump. I did try different RTP Engine version 10. and 9. but whit seme result.

Currenty I'm on version: 9.5.2.1+0~mr9.5.2.1 git-mr9.5.2-a82ec5ba

In log i found this message:

Nov  8 12:48:11 voip-gw3 rtpengine[799]: corrupted size vs. prev_size

and then in systemd log:

Nov  8 12:48:12 voip-gw3 systemd[1]: ngcp-rtpengine-daemon.service: Main process exited, code=killed, status=6/ABRT
Nov  8 12:48:13 voip-gw3 systemd[1]: ngcp-rtpengine-daemon.service: Failed with result 'signal'.
Nov  8 12:48:13 voip-gw3 systemd[1]: ngcp-rtpengine-daemon.service: Consumed 3h 2min 21.645s CPU time.

Here is gdb ouput of bt full on core dump:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {18446744067266837211, 0 <repeats 15 times>}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f04050d4537 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0xf4000000e6, sa_sigaction = 0xf4000000e6}, sa_mask = {__val = {139653881540688, 139653881537984, 48, 139653881540464, 16, 139653881537952, 139653881537952, 32, 139655277784253, 32, 48, 1013612281857, 12884901925, 
              176093659255, 94489280721, 876173328434}}, sa_flags = 208, sa_restorer = 0x8000000073}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f040512d768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f040523be2d "%s\n") at ../sysdeps/posix/libc_fatal.c:155
        ap = {{gp_offset = 24, fp_offset = 32515, overflow_arg_area = 0x7f03f839dd50, reg_save_area = 0x7f03f839dce0}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
#3  0x00007f0405134a5a in malloc_printerr (str=str@entry=0x7f040523a020 "corrupted size vs. prev_size") at malloc.c:5347
No locals.
#4  0x00007f04051357a6 in unlink_chunk (p=p@entry=0x7f03b4003180, av=0x7f03b4000020) at malloc.c:1454
        fd = <optimized out>
        bk = <optimized out>
#5  0x00007f0405138951 in _int_realloc (av=av@entry=0x7f03b4000020, oldp=oldp@entry=0x7f03b4003070, oldsize=oldsize@entry=272, nb=528) at malloc.c:4594
        newp = 0x7f03b4003070
        newsize = 14704
        newmem = <optimized out>
        next = 0x7f03b4003180
        remainder = <optimized out>
        remainder_size = <optimized out>
        __PRETTY_FUNCTION__ = "_int_realloc"
        nextsize = <optimized out>
#6  0x00007f0405139c56 in __GI___libc_realloc (oldmem=0x7f03b4003080, bytes=512) at malloc.c:3235
        ar_ptr = 0x7f03b4000020
        nb = 139655241444577
        newp = <optimized out>
        hook = <optimized out>
        oldp = 0x7f03b4003070
        oldsize = 272
        __PRETTY_FUNCTION__ = "__libc_realloc"
#7  0x00007f04075f7de8 in g_realloc () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#8  0x00007f04076149b2 in g_string_set_size () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#9  0x000055fca889b2c8 in rtcp_sender_report (srrs=0x7f03f839ded0, rrs=0x7f03f839de90, octets=<optimized out>, packets=<optimized out>, ts=<optimized out>, ssrc_out=19425566, ssrc=<optimized out>, ssr=0x7f03f839deb0) at rtcp.c:1524
        ret = 0x7f03dc1ed2e0
        i = <optimized out>
        sr = 0x7f03b4003080
        n = <optimized out>
        sdes = 0x7f03b4003174
        ret = <optimized out>
        sr = <optimized out>
        i = <optimized out>
        n = <optimized out>
        __PRETTY_FUNCTION__ = "rtcp_sender_report"
        sdes = <optimized out>
        s = <optimized out>
        rr = <optimized out>
        se = <optimized out>
        tv_diff = <optimized out>
        ntp_middle_bits = <optimized out>
        jitter = <optimized out>
        lost = <optimized out>
        tot = <optimized out>
        si = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        srr = <optimized out>
        __cc = <optimized out>
#10 rtcp_send_report (media=media@entry=0x7f03d8280b60, ssrc_out=ssrc_out@entry=0x7f03980b4488) at rtcp.c:1591
        ps = <optimized out>
        other_media = <optimized out>
        rrs = {head = 0x0, tail = 0x0, length = 0}
        ssr = {ssrc = 19425566, ntp_msw = 3845360891, ntp_lsw = 4289456851, timestamp = 2387200, packet_count = 368, octet_count = 63296}
        srrs = {head = 0x7f03e00f3020, tail = 0x7f03e02133a0, length = 9}
        sr = <optimized out>
        srr = <optimized out>
#11 0x000055fca88d45bf in __rtcp_timer_run (q=<optimized out>, p=0x7f03e005e830) at codec.c:1229
        rt = <optimized out>
        media = 0x7f03d8280b60
        rtcp_timer = {tv_sec = <optimized out>, tv_usec = <optimized out>}
        ssrc_out = 0x7f03980b4488
#12 0x000055fca88e1858 in timerthread_queue_run_one (ttq=ttq@entry=0x55fca9ca4210, ttqe=ttqe@entry=0x7f03e005e830, run_func=<optimized out>) at timerthread.c:112
No locals.
#13 0x000055fca88e1d8b in timerthread_queue_run (ptr=0x55fca9ca4210) at timerthread.c:133
        ttqe = 0x7f03e005e830
        ret = <optimized out>
        ttq = <optimized out>
        next_send = {tv_sec = 0, tv_usec = 0}
        __PRETTY_FUNCTION__ = "timerthread_queue_run"
#14 0x000055fca88e1acd in timerthread_run (p=0x55fca892ec20 <codec_timers_thread>) at timerthread.c:54
        tt_obj = 0x55fca9ca4210
        sleeptime = <optimized out>
        tv = {tv_sec = <optimized out>, tv_usec = <optimized out>}
        tt = <optimized out>
        sleep = <optimized out>
#15 0x000055fca886fc92 in thread_detach_func (d=0x55fca9cbb820) at aux.c:210
        dt = 0x55fca9cbb820
        t = 0x7f03c0001ba0
        scheduler = <optimized out>
#16 0x00007f040527cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139655026177792, -1489945453339652766, 140726718131662, 140726718131663, 139655026172288, 2105344, 1537230448814148962, 1538732225785318754}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = 0
#17 0x00007f04051acdef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
rfuchs commented 2 years ago

This backtrace is from git-mr9.5.2-a82ec5ba ?

Is the location of the crash always the same between different core dumps?

cicahrad commented 2 years ago

This backtrace is from git-mr9.5.2-a82ec5ba ? yes it is.

Is the location of the crash always the same between different core dumps?

I did delete older ones. So I can't confirm. But I can switch traffic over rtpengine again and i'll see in next crash. Do you want crash on same version (git-mr9.5.2-a82ec5ba)?

rfuchs commented 2 years ago

I did delete older ones. So I can't confirm. But I can switch traffic over rtpengine again and i'll see in next crash. Do you want crash on same version (git-mr9.5.2-a82ec5ba)?

Yes that would be helpful, thanks

cicahrad commented 2 years ago

new crash on same point

here is gdb bt full output:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fe236106700 (LWP 1460369))]
(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {18446744067266837211, 0 <repeats 15 times>}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007fe240a2c537 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007fe240a85768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fe240b93e2d "%s\n") at ../sysdeps/posix/libc_fatal.c:155
        ap = {{gp_offset = 24, fp_offset = 0, overflow_arg_area = 0x7fe236104c80, reg_save_area = 0x7fe236104c10}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
#3  0x00007fe240a8ca5a in malloc_printerr (str=str@entry=0x7fe240b96280 "corrupted size vs. prev_size while consolidating") at malloc.c:5347
No locals.
#4  0x00007fe240a8e12e in _int_free (av=0x7fe1f0000020, p=0x7fe1f0015e10, have_lock=<optimized out>) at malloc.c:4332
        size = 1504
        fb = <optimized out>
        nextchunk = 0x7fe1f00163f0
        nextsize = 1024
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        __PRETTY_FUNCTION__ = "_int_free"
#5  0x00007fe242f663eb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#6  0x00007fe242f67689 in g_slice_free1 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#7  0x00007fe242f5918f in g_queue_foreach () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#8  0x00007fe242f59218 in g_queue_clear_full () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#9  0x00005570b086d7c7 in __free_ssrc_entry_call (ep=0x7fe21017ea20) at ssrc.c:51
        e = 0x7fe21017ea20
#10 0x00005570b081af25 in __obj_put (o=0x7fe21017ea20) at ../include/obj.h:183
No locals.
#11 __call_free (p=0x7fe214147940) at call.c:2826
        __cc = 0x7fe21000fe78
        c = 0x7fe214147940
        m = <optimized out>
        md = 0x0
        ps = 0x7fe21000fc00
        em = <optimized out>
        __PRETTY_FUNCTION__ = "__call_free"
#12 0x00005570b08192f5 in __obj_put (o=0x7fe214147940) at ../include/obj.h:183
No locals.
#13 log_info_clear () at log_funcs.h:22
No locals.
#14 kill_calls_timer (list=0x0, url=url@entry=0x0) at call.c:474
        dup_tags = 0x0
        cb_addr = <optimized out>
        ca = <optimized out>
        csl = <optimized out>
        cm = <optimized out>
        cd = <optimized out>
        url_prefix = <optimized out>
        url_suffix = <optimized out>
        xh = <optimized out>
        url_buf = " \000\000 \342\177\000\000@", '\000' <repeats 16 times>, "\020\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000+\023\030\342\177\000\000\320)\034\034\342\177\000\000\350\355\364B\342\177\000\000\200\251\n\314\341\177\000\000\064K\363B\342\177\000\000\000\002\000\000\000\000\000\000\200\251\n\314\341\177\000\000\000\002\000\000\000\000\000\000\034N\363B\342\177\000\000\030\000\000\000\000\000\000\000\000P\321\374\207\266^\023"
--Type <RET> for more, q to quit, c to continue without paging-- 
#15 0x00005570b081a06d in call_timer (ptr=<optimized out>) at call.c:705
        hlp = {del_timeout = 0x0, del_scheduled = 0x7fe220001b20, addr_sfd = 0x7fe1cc0aa980, transcoded_media = 6}
        i = <optimized out>
        l = 0x7fe20808aae0
        calls = 0x0
        ke = <optimized out>
        ps = <optimized out>
        sink = <optimized out>
        tmpstats = {packets = {p = <optimized out>}, bytes = {p = <optimized out>}, errors = {p = <optimized out>}, delay_min = <optimized out>, delay_avg = <optimized out>, delay_max = <optimized out>, in_tos_tclass = <optimized out>, foreign_sessions = {
            p = <optimized out>}, offers = {p = <optimized out>}, answers = {p = <optimized out>}, deletes = {p = <optimized out>}, transcoded_media = {p = <optimized out>}, ipv4_sessions = {p = <optimized out>}, ipv6_sessions = {p = <optimized out>}, mixed_sessions = {
            p = <optimized out>}}
        j = <optimized out>
        update = <optimized out>
        sfd = <optimized out>
        rs = <optimized out>
        pt = <optimized out>
        ep = {address = {family = 0x5570b08cbac0 <__socket_families>, u = {ipv4 = {s_addr = 128008921}, ipv6 = {__in6_u = {__u6_addr8 = "\331B\241\a", '\000' <repeats 11 times>, __u6_addr16 = {17113, 1953, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {128008921, 0, 0, 0}}}}}, 
          port = 10306}
        offers = <optimized out>
        answers = <optimized out>
        deletes = <optimized out>
        tv_start = {tv_sec = 1637764967, tv_usec = 78}
        run_diff = <optimized out>
        last_run = {tv_sec = 1637764967, tv_usec = 78}
        interval = 1000000
        next = <optimized out>
        tv_stop = {tv_sec = 1637764966, tv_usec = 1195}
        duration = <optimized out>
#16 0x00005570b080fa1a in poller_timers_run (p=0x5570b0c99740) at poller.c:367
        l = 0x5570b0c2b590
        ti = <optimized out>
#17 poller_timer_loop (d=0x5570b0c99740) at poller.c:596
        p = <optimized out>
        tv = {tv_sec = 1637764967, tv_usec = 77}
        wt = <optimized out>
#18 0x00005570b080fc92 in thread_detach_func (d=0x5570b0c728c0) at aux.c:210
        dt = 0x5570b0c728c0
        t = 0x7fe22c001bb0
        scheduler = <optimized out>
#19 0x00007fe240bd4ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140609546381056, 806353311393508708, 140737001982190, 140737001982191, 140609546375552, 93942490774208, -795780129991037596, -795941509142942364}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = 0
#20 0x00007fe240b04def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
szcom commented 2 years ago

Also after going to Debian 11 experienced a random chrashes on 10.2 and 10.3 builds from https://dfx.at/rtpengine/ With the following output:

rtpengine: ./lib/core-net/pollfd.c:73: _lws_change_pollfd: Assertion `wsi->position_in_fds_table < (int)pt->fds_count' failed.

may be related to http interface, polling /metrics and /ping without traffic can make it crash. no repro yet.

rfuchs commented 2 years ago

@szcom this is unrelated to what was reported in this issue. Please open a new one.

cicahrad commented 2 years ago

We were playing whit this issue. We tried change debian version, RTPEngine verision nothing help. As last we try run RTPEngine whitout kenrel module for packet forwarding. Now We are using userspace and RTPEngine has uptime 82days whitout any issue.

Problem will probabaly some where in kernel module. If you need more information write what yoiu want know.

We use RPTEngine whit OpenSIPs on Proxmox cluster