meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.23k stars 2.48k forks source link

Video MCU test segfault #27

Closed phsultan closed 10 years ago

phsultan commented 10 years ago

Hi, I'm getting a segfault running the video MCU test example. Here is my testing environment:

My NAT configuration section in janus.cfg: [nat] public_ip = 1.2.3.4 stun_server = stun.voip.eutelia.it stun_port = 3478

The echo test works perfect, but the video MCU test makes janus crash right after the second participant enters the room during ICE negotiation (line 536 in ice.c). Here is the backtrace:

(gdb) bt
#0  0x000000000041a74f in janus_ice_cb_nice_recv (agent=0x7fe3d0036620, stream_id=2, component_id=1, len=88, buf=0x7fe39a1ebbd0 "\001\001", ice=0x7fe3d007dde0) at ice.c:536
#1  0x00007fe3e81a499e in component_emit_io_callback (component=0x7fe3d00a2640, buf=0x7fe39a1ebbd0 "\001\001", buf_len=88) at component.c:813
#2  0x00007fe3e81a9ada in component_io_cb (gsocket=<value optimized out>, condition=<value optimized out>, user_data=0x7fe39c002f40) at agent.c:3923
#3  0x00007fe3e7aabb06 in socket_source_dispatch (source=0x7fe3d007def0, callback=<value optimized out>, user_data=<value optimized out>) at gsocket.c:3165
#4  0x00007fe3e750f1c3 in g_main_dispatch (context=0x7fe3d0002a60) at gmain.c:3054
#5  g_main_context_dispatch (context=0x7fe3d0002a60) at gmain.c:3630
#6  0x00007fe3e75110c8 in g_main_context_iterate (context=0x7fe3d0002a60, block=1, dispatch=1, self=<value optimized out>) at gmain.c:3701
#7  0x00007fe3e7512295 in g_main_loop_run (loop=0x7fe3d0002100) at gmain.c:3895
#8  0x000000000041b180 in janus_ice_thread (data=0x7fe3a8002550) at ice.c:662
#9  0x00007fe3e7534b55 in g_thread_proxy (data=0x7fe39c0046d0) at gthread.c:798
#10 0x00007fe3e5c919d1 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fe3e59deb5d in clone () from /lib64/libc.so.6
(gdb)

Let me know if you need more information, and congrats for the great work Lorenzo !

Philippe

lminiero commented 10 years ago

Hi Philippe,

have you done a make clean when switching changing the nodatachans flag? If you didn't, you actually used the same code.

Not sure about what's causing the crash, since it seems to be failing when accessing stream->handle, right after it checked that stream is actually not null. I'm wondering whether there may be some stack corruption there, even if that shouldn't be the case.

I'm unable to replicate the issue so causes may be different. Is there any additional information you can get out of gdb, e.g., whether stream is actually a valid janus_ice_stream instance and not some junk pointer?

Thanks for your kind words, BTW :-)

phsultan commented 10 years ago

Yep, I had run make clean before reinstalling. I'm indeed facing a stack corruption here, as shown by gdb:

#0  0x000000000041a74f in janus_ice_cb_nice_recv (agent=0x7fe3d0036620, stream_id=2, component_id=1, len=88, buf=0x7fe39a1ebbd0 "\001\001", ice=0x7fe3d007dde0) at ice.c:536
        component = 0x7fe3d007dde0
        __FUNCTION__ = "janus_ice_cb_nice_recv"
        stream = 0x2020200a2c343332
        handle = 0x7fe3d0036620
....

The memory pointer for stream cannot be accessed:

(gdb) x/x 0x2020200a2c343332
0x2020200a2c343332: Cannot access memory at address 0x2020200a2c343332

Upper in the stack(in component_io_cb), stream has a valid address though, so it's value is likely modified somewhere in the path. Here is how streamlooks like in component_io_cb:

gdb) print {janus_ice_stream}0x7fe3d007d4a0
$2 = {handle = 0x0, stream_id = 2, cdone = 0, audio_ssrc = 1, video_ssrc = 0, audio_ssrc_peer = 2617254752, video_ssrc_peer = 32739, payload_type = 0, 
  dtls_role = JANUS_DTLS_ROLE_SERVER, ruser = 0x6246396b <Address 0x6246396b out of bounds>, rpass = 0x0, components = 0x0, rtp_component = 0x0, rtcp_component = 0x0, noerrorlog = 0, 
  mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
    __align = 0}}
(gdb) 

Since those functions are exported by libnice and glib, I'd like to have mine match with yours. Can you provide me with your versions of libniceand glib? I'm running libnice-0.1.7and glib-2.36.4.

Thanks!

Philippe

lminiero commented 10 years ago

libnice is 0.1.4 and glib2 is 2.34.2: my Fedora 18 doesn't have the latest stuff :-) What OS are you using? I'll try to replicate the issue with a VM.

phsultan commented 10 years ago

I'm on a CentOS 6.5

lminiero commented 10 years ago

Looking at the dump again, it may be some kind of race condition. In fact, I see that the stream id that is passed is 2, which is normally associated with the video stream. When Bundle is involved, though, both audio and video share the same ICE stream (1), and the second stream is removed. So this may be a scenario where the second stream is discarded too late, while already in use by libnice.

Are you using Chrome or Firefox for your tests?

phsultan commented 10 years ago

I'm using Chrome. Switched to Firefox 30.0 and successfully connected 3 people to the same room without any crash, though my CPU went up to 199% :-D

lminiero commented 10 years ago

Were you testing both server and clients on the same mahine? The clients CPU can really grow as soon as you start involving more flows at the same time, especially when you have 3-4 clients all handling 3-4 streams! The server side itself shouldn't be affected much with just 3-4 users in the MCU, at least not according to the measurements we did some weeks ago.

If everything went fine with Firefox, I guess the issue is indeed a race condition somewhere in the process of handling the bundle switch. I'm already looking into it and hope to have something ready soon.

phsultan commented 10 years ago

I meant the CPU on the server that runs janus actually, which is a cloud instance. But that's another thing I believe. I did not check the CPU on my laptop, which I indeed connected 3 times to the server.

Thanks a lot Lorenzo !

lminiero commented 10 years ago

I just pushed a commit that should better handle the case when the gateway is offering (which is what happens when a second participant joins the room and you attach to its feed). Let me know if anything improves.

phsultan commented 10 years ago

Unfortunately no, it did not help.

lminiero commented 10 years ago

Can you launch Janus with a higher debugging level (-d 5) and pass me the log up to the point where it crashes? I don't know if files can be attached on issues so if not, since I'd rather avoid having long dumps of text here, please send me the log privately (lorenzo[at]meetecho.com).

lminiero commented 10 years ago

Closing as this should have been fixed, feel free to reopen otherwise.