sipwise / rtpengine

The Sipwise media proxy for Kamailio
GNU General Public License v3.0
784 stars 368 forks source link

Segmentation fault #8

Closed qmex-mp closed 9 years ago

qmex-mp commented 10 years ago

Since updating from mediaproxy-ng to rtpengine (we had to because of Chrome Update). We experience segmentation faults, normaly on in two weeks. Sometimes at night with zero load, sometimes, under load. This week after update to last commit from July 2nd, Thursday and Friday under Heavy Load: No Problem, today with a few clients: 4 segmention faults in one day. Log from Kamailio:

RTP/AVP", "call-id": "62c60772-ca0e-5a0e-2ade-75139d3d0391", "received-from": [ "IP4", "94.220.217.63" ], "from-tag": "uMNSTxXhA5xLpVFGCDVV", "command": "delete" } Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] Deleting full call Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] Final packet stats: Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] --- Tag 'as3a1098b8', created 122:10 ago, in dialogue with 'uMNSTxXhA5xLpVFGCDVV' Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] ------ Media #1, port 30334 <> 94.220.217.63:55463, 381275 p, 64611108 b, 0 e Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] ------ Media #1, port 30335 <> 94.220.217.63:55463 (RTCP), 16601 p, 1751640 b, 0 e Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] --- Tag 'uMNSTxXhA5xLpVFGCDVV', created 122:10 ago, in dialogue with 'as3a1098b8' Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] ------ Media #1, port 30332 <> 172.16.16.104:11116, 366484 p, 64500728 b, 0 e Jun 25 19:08:48 konfwebrtc1 rtpengine[25185]: [62c60772-ca0e-5a0e-2ade-75139d3d0391] ------ Media #1, port 30333 <> 172.16.16.104:11117 (RTCP), 1465 p, 114270 b, 0 e Jun 25 19:08:48 konfwebrtc1 kernel: [4922154.524970] rtpengine[25190]: segfault at 18c7d50 ip 00000000018c7d50 sp 00007fa78ffd7498 error 15

If you are interested in a core dump, i can deliver one.

rfuchs commented 10 years ago

If you can send the core dump AND the binary, then I can see if I can pull anything out of it.

edel commented 10 years ago

Hi,

seeing same random segfaults in our end too. i'm sending the binary & core files in the email. thanks.

rfuchs commented 10 years ago

Backtrace from edel, for reference:

#0  asn1_ex_i2c (pval=0x7f760c002a68, cout=0x0, putype=0x7f7623be8d7c, it=0x7f762aca2220) at tasn_enc.c:599
#1  0x00007f762a9e3e7f in asn1_i2d_ex_primitive (pval=0x7f760c002a68, out=0x0, it=0x7f762aca2220, tag=-1, 
    aclass=<value optimized out>) at tasn_enc.c:518
#2  0x00007f762a9e40f8 in ASN1_item_ex_i2d (pval=0x7f760c002a68, out=0x0, it=0x7f762aca2220, tag=-1, aclass=0)
    at tasn_enc.c:158
#3  0x00007f762a9e4676 in asn1_template_ex_i2d (pval=0x7f760c002a68, out=0x0, tt=0x7f762aca78e8, 
    tag=<value optimized out>, iclass=0) at tasn_enc.c:413
#4  0x00007f762a9e433c in ASN1_item_ex_i2d (pval=0x7f760c000148, out=0x0, it=0x7f762aca1540, tag=16, aclass=0)
    at tasn_enc.c:231
#5  0x00007f762a9e4676 in asn1_template_ex_i2d (pval=0x7f760c000148, out=0x0, tt=0x7f762aca7dd0, 
    tag=<value optimized out>, iclass=0) at tasn_enc.c:413
#6  0x00007f762a9e433c in ASN1_item_ex_i2d (pval=0x7f760c0037b0, out=0x0, it=0x7f762aca1a00, tag=16, aclass=0)
    at tasn_enc.c:231
#7  0x00007f762a9e4676 in asn1_template_ex_i2d (pval=0x7f760c0037b0, out=0x0, tt=0x7f762aca7f20, 
    tag=<value optimized out>, iclass=0) at tasn_enc.c:413
#8  0x00007f762a9e433c in ASN1_item_ex_i2d (pval=0x7f7623be91a8, out=0x0, it=0x7f762aca1a40, tag=16, aclass=0)
    at tasn_enc.c:231
#9  0x00007f762a9e49d7 in asn1_item_flags_i2d (val=0x7f760c0037b0, out=0x7f7623be9208, it=0x7f762aca1a40, 
    flags=0) at tasn_enc.c:110
#10 0x00007f762a9dbcd8 in ASN1_item_digest (it=<value optimized out>, type=0x7f762aca7340, 
    asn=<value optimized out>, md=0x7f7623be9260 "", len=0x7f7623be923c) at a_digest.c:105
#11 0x0000000000422e71 in generic_func (o=0x7f7623be9260 "", x=0x7f760c0037b0) at dtls.c:286
#12 sha_256_func (o=0x7f7623be9260 "", x=0x7f760c0037b0) at dtls.c:303
#13 0x00000000004222ee in __dtls_hash (ps=0x7f75f40107f0) at dtls.h:80
#14 dtls_verify_cert (ps=0x7f75f40107f0) at dtls.c:366
#15 0x000000000040e2ef in __init_stream (ps=0x7f75f40107f0) at call.c:1547
#16 0x0000000000410cf1 in __init_streams (A=0x7f75f400e150, B=0x7f75f400e000, sp=0x0) at call.c:1581
#17 0x0000000000411511 in monologue_offer_answer (monologue=0x7f75f4004540, streams=0x7f7600000002, flags=
    0x7f7623be9420) at call.c:1988
#18 0x00000000004201d4 in call_offer_answer_ng (input=<value optimized out>, m=0xd98c00, output=
    0x7f7604051f08, opmode=OP_ANSWER) at call_interfaces.c:602
#19 0x0000000000415e9c in control_ng_incoming (obj=0xd9c820, buf=<value optimized out>, sin=0x7f7623bf9700, 
    addr=0x7f7623bf96c0 "127.0.0.1:46699") at control_ng.c:115
#20 0x00000000004158de in udp_listener_incoming (fd=7, p=0xd9aea0, x=<value optimized out>)
    at udp_listener.c:52
---Type <return> to continue, or q <return> to quit---
#21 0x000000000040b4ce in poller_poll (p=0xd97540, timeout=<value optimized out>) at poller.c:354
#22 0x000000000040722d in poller_loop (d=0xd97540) at main.c:542
#23 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160
#24 0x00007f762a2929d1 in start_thread (arg=0x7f7623bfa700) at pthread_create.c:301
#25 0x00007f7629fdfb5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

It's crashing when doing a delayed verification of the x.509 certificate. My guess is that the x.509 struct has already been freed when it's doing the verification, which would explain the crash.

https://github.com/sipwise/rtpengine/commit/16cf79fcade246475dcad40a04a1d607a744dd23

This might fix it, but consider it completely untested and experimental, because it is. I don't have a SIP client that I can test this with.

qmex-mp commented 10 years ago

Hi Richard,

attached the core dump and binary from yesterday.

rfuchs commented 10 years ago

https://github.com/sipwise/rtpengine/commit/062256baec8d84b9b4cb68e6598e8bd9e2b0c9dd

This should fix the segfault reported by maxknax.

edel commented 10 years ago

hi Richard,

thank you for your reply. im still getting segfaults still. i'll test the latest commit and see if this helps.

rfuchs commented 10 years ago

Did either of you experience any more crashes?

edel commented 10 years ago

hi Richard,

sorry for the delay. i still experience crashes with16cf79f, but 062256b seems to fix the issue. i did not see single segfault for the last 2 or 3 days.

thanks!

On Tue, Jul 15, 2014 at 12:48 AM, Richard Fuchs notifications@github.com wrote:

Did either of you experience any more crashes?

— Reply to this email directly or view it on GitHub https://github.com/sipwise/rtpengine/issues/8#issuecomment-48917129.

edel

qmex-mp commented 10 years ago

Hi Richard,

we had this crashes like one time in two weeks (except for this special Saturday . . .). Since the Update we had no crash, in 2 – 3 Weeks, I will give you a reply.

rfuchs commented 10 years ago

Seems to be fixed, therefore closing

qmex-mp commented 10 years ago

Im sorry that i have to reopen this, but, today:

[14149.655738] rtpengine[2896]: segfault at 7f1aace18e50 ip 00007f1aaa04907d sp 00007f1aa5802e20 error 4 in libc-2.13.so[7f1aa9fd2000+182000] [14149.715131] rtpengine (2896) used greatest stack depth: 3816 bytes left

rfuchs commented 10 years ago

Please post a full backtrace, "thread apply all bt" in gdb

rfuchs commented 10 years ago

Backtrace for reference:


Thread 7 (Thread 0x7f1aa6015700 (LWP 2893)):
#0  0x00007f1aaa0ae743 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000040b93b in poller_poll ()
#2  0x00000000004080fd in _start ()

Thread 6 (Thread 0x7f1aa6816700 (LWP 2892)):
#0  0x00007f1aaa0ae743 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000040b93b in poller_poll ()
#2  0x00000000004080fd in _start ()

Thread 5 (Thread 0x7f1aaca64720 (LWP 2889)):
#0  0x00007f1aaa07e60d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1aaa0a8304 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000407ee2 in main ()

Thread 4 (Thread 0x7f1a9ffff700 (LWP 2894)):
#0  0x00007f1aaa0ae743 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000040b93b in poller_poll ()
#2  0x00000000004080fd in _start ()

Thread 3 (Thread 0x7f1aa7818700 (LWP 2890)):
#0  0x00007f1aaa00519b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1aaa005243 in sigtimedwait () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00000000004081bb in _start ()

Thread 2 (Thread 0x7f1aa7017700 (LWP 2891)):
#0  0x00007f1aaa07e60d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1aaa0a8304 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000000000040c2a4 in poller_timers_wait_run ()
#3  0x000000000040812d in _start ()

Thread 1 (Thread 0x7f1aa5814700 (LWP 2896)):
#0  0x00007f1aaa04907d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1aaa04b4d4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f1aaa04d1a2 in calloc () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f1aaa03de7d in open_memstream () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f1aaa0aa22b in __vsyslog_chk () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f1aaa0aa803 in __syslog_chk () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x000000000042121c in ilog ()
#7  0x00000000004210d7 in dtls_connection_init ()
#8  0x000000000040e22a in ?? ()
#9  0x000000000040e59d in ?? ()
#10 0x00000000004106d6 in monologue_offer_answer ()
#11 0x000000000041ddbc in ?? ()
#12 0x00000000004154b5 in ?? ()
#13 0x0000000000414bfd in ?? ()
#14 0x000000000040bac6 in poller_poll ()
#15 0x00000000004080fd in _start ()

Unfortunately, all this points to is possibly an unspecified memory corruption. For a better chance for catching these in the future, I would suggest installing the libc6-dbg package, as well as setting the environment variables export MALLOC_CHECK_=2 and export G_SLICE=always-malloc before starting rtpengine.

qmex-mp commented 10 years ago

Ive adopted the two ENV Vars into the init script. I´ll keep you posted

vinzens commented 10 years ago

Hi, We are working with same version and found an similar segfault. Maybe it helps having more debug data. We can reproduce this issue on every call we make. Looks like it's related to TOS rewriting but it's default in config.

Syslog looks like: Aug 5 12:11:00 kaltz rtpengine[6282]: Got valid command from udp:XX.XX.XX.XX:42053: 3991_9 U b3ade0538a7d-z3yy3bx56fks 192.168.178.24 58758 fadjo48mnf;1 Aug 5 12:11:00 kaltz rtpengine[6282]: [b3ade0538a7d-z3yy3bx56fks] Creating new call Aug 5 12:11:00 kaltz kernel: : [ 1679.147574] rtpengine[6288]: segfault at 0 ip 0000000000410230 sp 00007fcc0fb812a0 error 4 in rtpengine[400000+2a000]

gdb says:

(gdb) thread apply all bt

Thread 7 (Thread 0x7fcc10393700 (LWP 6287)):
#0  0x00007fcc14c2d863 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000040bacb in poller_poll (p=p@entry=0x9171a0, timeout=timeout@entry=100) at poller.c:308
#2  0x000000000040821d in poller_loop (d=0x9171a0) at main.c:548
#3  0x000000000040c53f in thread_detach_func (d=0x91c5e0) at aux.c:160
#4  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#5  0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7fcc11395700 (LWP 6285)):
#0  0x00007fcc14c2d863 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000040bacb in poller_poll (p=p@entry=0x9171a0, timeout=timeout@entry=100) at poller.c:308
#2  0x000000000040821d in poller_loop (d=0x9171a0) at main.c:548
#3  0x000000000040c53f in thread_detach_func (d=0x916db0) at aux.c:160
#4  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#5  0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7fcc11b96700 (LWP 6284)):
#0  0x00007fcc14bfd6dd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fcc14c27424 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x000000000040c434 in poller_timers_wait_run (p=p@entry=0x9171a0, max=100000, max@entry=100) at poller.c:500
#3  0x000000000040824d in timer_loop (d=0x9171a0) at main.c:541
#4  0x000000000040c53f in thread_detach_func (d=0x917490) at aux.c:160
#5  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#6  0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fcc10b94700 (LWP 6286)):
#0  0x00007fcc14c2d863 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000040bacb in poller_poll (p=p@entry=0x9171a0, timeout=timeout@entry=100) at poller.c:308
#2  0x000000000040821d in poller_loop (d=0x9171a0) at main.c:548
#3  0x000000000040c53f in thread_detach_func (d=0x917260) at aux.c:160
#4  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#5  0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7fcc12397700 (LWP 6283)):
#0  0x00007fcc14b8426b in do_sigtimedwait (set=0x7fcc12396e20, info=0x0, timeout=0x7fcc12396e10)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigtimedwait.c:56
#1  0x00007fcc14b84313 in *__GI___sigtimedwait (set=<optimized out>, set@entry=0x7fcc12396e20, info=<optimized out>, 
    info@entry=0x0, timeout=<optimized out>, timeout@entry=0x7fcc12396e10)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigtimedwait.c:83
#2  0x00000000004082db in sighandler (x=<optimized out>) at main.c:114
#3  0x000000000040c53f in thread_detach_func (d=0x917790) at aux.c:160
#4  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#5  0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fcc175df720 (LWP 6282)):
#0  0x00007fcc14bfd6dd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fcc14c27424 in usleep (useconds=<optimized out>, useconds@entry=100000)
    at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x0000000000408002 in main (argc=1, argv=0x7fff198abd78) at main.c:568

Thread 1 (Thread 0x7fcc0fb92700 (LWP 6288)):
#0  __tos_change (call=<optimized out>, flags=<optimized out>) at call.c:1877
#1  monologue_offer_answer (monologue=monologue@entry=0x9203b0, streams=streams@entry=0x7fcc0fb81350, 
    flags=flags@entry=0x0) at call.c:1910
---Type <return> to continue, or q <return> to quit--- 
#2  0x000000000041efd9 in call_update_lookup_udp (out=out@entry=0x91fea0, m=0x91a050, opmode=opmode@entry=OP_OFFER)
    at call_interfaces.c:151
#3  0x000000000041f247 in call_update_udp (out=out@entry=0x91fea0, m=<optimized out>) at call_interfaces.c:180
#4  0x000000000041352c in control_udp_incoming (obj=0x918410, buf=<optimized out>, sin=0x7fcc0fb817a0, 
    addr=<optimized out>) at control_udp.c:85
#5  0x0000000000414e9d in udp_listener_incoming (fd=5, p=0x91c4c0, x=<optimized out>) at udp_listener.c:52
#6  0x000000000040bc56 in poller_poll (p=p@entry=0x9171a0, timeout=timeout@entry=100) at poller.c:354
#7  0x000000000040821d in poller_loop (d=0x9171a0) at main.c:548
#8  0x000000000040c53f in thread_detach_func (d=0x91eca0) at aux.c:160
#9  0x00007fcc14ee2b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007fcc14c2d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x00000
rfuchs commented 10 years ago

Hi, this is an unrelated crashed and surfaces when the NG protocol is not used. Fixed by https://github.com/sipwise/rtpengine/commit/9d001e3452b7b7c117a896f9deb7ba3a02138358

rfuchs commented 9 years ago

Is this still an open issue?

vinzens commented 9 years ago

Hi, sorry for that late reply. No, it's working fine for us. Thanks a lot!

qmex-mp commented 9 years ago

Since the system is running in full debug modeIt didn’t happen for a few months now.

rfuchs commented 9 years ago

Good enough for me