meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.15k stars 2.47k forks source link

Segmentation fault on webinar host leave #14

Closed francoisTemasys closed 10 years ago

francoisTemasys commented 10 years ago

I modified a little bit the code of the sharescreen example to have a webrtc webinar, no HTTPS required and it's showing the camera+microphone.

Here is how to reproduce:

Here is the output:

No WebRTC media anymore
[Thread 0x7fffccff9700 (LWP 15466) exited]
Detaching handle from JANUS VideoRoom plugin
No WebRTC media anymore
[Thread 0x7fff86ffd700 (LWP 15472) exited]
Destroying session 616705426
[Thread 0x7fff877fe700 (LWP 15436) exited]
[Thread 0x7fffceffd700 (LWP 15425) exited]
[2396587641] WebRTC resources freed
[2396587641] Handle and related resources freed
[Thread 0x7fff8f7fe700 (LWP 15423) exited]
[New Thread 0x7fffceffd700 (LWP 15501)]
[New Thread 0x7fffccff9700 (LWP 15502)]
No WebRTC media anymore
[New Thread 0x7fff86ffd700 (LWP 15503)]
[Thread 0x7fff86ffd700 (LWP 15503) exited]
[New Thread 0x7fff877fe700 (LWP 15504)]
Detaching handle from JANUS VideoRoom plugin
No WebRTC media anymore
[Thread 0x7fff877fe700 (LWP 15504) exited]
[New Thread 0x7fff86ffd700 (LWP 15505)]
Detaching handle from JANUS VideoRoom plugin
No WebRTC media anymore
[Thread 0x7fff86ffd700 (LWP 15505) exited]
[New Thread 0x7fff877fe700 (LWP 15506)]
Destroying session 3728690222
[Thread 0x7fff8effd700 (LWP 15467) exited]
[Thread 0x7fff8cff9700 (LWP 15435) exited]
[984385855] WebRTC resources freed
[984385855] Handle and related resources freed
[Thread 0x7fff8d7fa700 (LWP 15433) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcd7fa700 (LWP 15440)]
0x0000000000414699 in janus_ice_relay_rtcp ()
(gdb) backtrace
#0  0x0000000000414699 in janus_ice_relay_rtcp ()
#1  0x0000000000411f2a in janus_ice_cb_nice_recv ()
#2  0x00007ffff7bb5996 in ?? () from /usr/lib/x86_64-linux-gnu/libnice.so.10
#3  0x00007ffff597888b in socket_source_dispatch (source=source@entry=0x7fffdc0e70e0, callback=<optimized out>, user_data=<optimized out>)
    at /build/buildd/glib2.0-2.38.1/./gio/gsocket.c:3264
#4  0x00007ffff76a13b6 in g_main_dispatch (context=0x7fffdc0e7c30) at /build/buildd/glib2.0-2.38.1/./glib/gmain.c:3065
#5  g_main_context_dispatch (context=context@entry=0x7fffdc0e7c30) at /build/buildd/glib2.0-2.38.1/./glib/gmain.c:3641
#6  0x00007ffff76a1708 in g_main_context_iterate (context=0x7fffdc0e7c30, block=block@entry=1, dispatch=dispatch@entry=1, 
    self=<optimized out>) at /build/buildd/glib2.0-2.38.1/./glib/gmain.c:3712
#7  0x00007ffff76a1b0a in g_main_loop_run (loop=0x7fffdc0e5d20) at /build/buildd/glib2.0-2.38.1/./glib/gmain.c:3906
#8  0x00000000004130cc in janus_ice_thread ()
#9  0x00007ffff76c60f5 in g_thread_proxy (data=0x7fffb8000f20) at /build/buildd/glib2.0-2.38.1/./glib/gthread.c:798
#10 0x00007ffff52e4f6e in start_thread (arg=0x7fffcd7fa700) at pthread_create.c:311
#11 0x00007ffff5d679cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

This seg fault doesn't appear all the time actually. Another race problem I guess.

lminiero commented 10 years ago

Yes, that's consistent with the segfaults we experienced in similar tests. Specifically, a few days ago we managed to test an MCU room with about 20 participants, and when most of them started leaving the same crash when relaying RTCP occurred. Apparently there's still something to improve in terms of keeping all the involved components aware of the fact that a session is not available anymore: either that, or there's a leak that is corrupting some heap memory.

Thanks for testing!

francoisTemasys commented 10 years ago

Is it normal that in the videoroom plugin the function void janus_videoroom_hangup_media(janus_plugin_session *handle) [543] is not protected by any mutex and especially: g_hash_table_remove(participant->room->participants, GUINT_TO_POINTER(participant->user_id)); [566] . Why are the videorooms not protected by a mutex?

While on the audiobridge plugin that's one of the first thing done.

francoisTemasys commented 10 years ago

To be more precise on the line producing the segfault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff727fc700 (LWP 22912)]
0x00000000004189cf in janus_ice_relay_rtcp (handle=0x7fff7c001960, video=1, buf=0x7fff727ebbf0 "\201", <incomplete sequence \311>, len=56)
    at /home/game/GitRepo/janus-gateway/ice.c:952
952     janus_ice_component *component = handle->rtcpmux ? stream->rtp_component : stream->rtcp_component;
francoisTemasys commented 10 years ago

I may have found the fix. When we receive a rtcp packet, if we are subscriber we can at least check if the publisher is sending something otherwise he may be leaving, else if we are publisher we don't need to relay if we don't send audio or video (and we are probably going to leave). janus_videoroom.c

@@ -515,26 +519,30 @@ void janus_videoroom_incoming_rtcp(janus_plugin_session *handle, int video, char
        if(l && l->feed) {
            janus_videoroom_participant *p = l->feed;
            if(p && p->session) {
-               if(p->bitrate > 0)
-                   janus_rtcp_cap_remb(buf, len, p->bitrate);
-               gateway->relay_rtcp(p->session->handle, video, buf, len);
+               if((!video && p->audio_active) || (video && p->video_active)) {
+                   if(p->bitrate > 0)
+                       janus_rtcp_cap_remb(buf, len, p->bitrate);
+                   gateway->relay_rtcp(p->session->handle, video, buf, len);
+               }
            }
        }
    } else if(session->participant_type == janus_videoroom_p_type_publisher) {
        /* FIXME Badly: we're just bouncing the incoming RTCP back with modified REMB, we need to improve this... */
        janus_videoroom_participant *participant = (janus_videoroom_participant *)session->participant;
-       if(participant->bitrate > 0)
-           janus_rtcp_cap_remb(buf, len, participant->bitrate);
-       gateway->relay_rtcp(handle, video, buf, len);
-       /* FIXME Badly: we're also blinding forwarding the publisher RTCP to all the listeners: this probably means confusing them... */
-       if(participant->listeners != NULL) {
-           GSList *ps = participant->listeners;
-           while(ps) {
-               janus_videoroom_listener *l = (janus_videoroom_listener *)ps->data;
-               if(l->session && l->session->handle) {
-                   gateway->relay_rtcp(l->session->handle, video, buf, len);
+       if((!video && participant->audio_active) || (video && participant->video_active)) {
+           if(participant->bitrate > 0)
+               janus_rtcp_cap_remb(buf, len, participant->bitrate);
+           gateway->relay_rtcp(handle, video, buf, len);
+           /* FIXME Badly: we're also blinding forwarding the publisher RTCP to all the listeners: this probably means confusing them... */
+           if(participant->listeners != NULL) {
+               GSList *ps = participant->listeners;
+               while(ps) {
+                   janus_videoroom_listener *l = (janus_videoroom_listener *)ps->data;
+                   if(l->session && l->session->handle) {
+                       gateway->relay_rtcp(l->session->handle, video, buf, len);
+                   }
+                   ps = ps->next;
                }
-               ps = ps->next;
            }
        }
    }
lminiero commented 10 years ago

Francois,

thanks for the patch! I thought we already checked for video_active in the code, but I guess this was only done for RTP and not RTCP as of yet (which might indeed explain why all the dumps I have are triggered by relay_rtcp). I'll try and integrate your changes in the code to see if this fixes it for our stress tests. Did it already fix the crashes in your case?

That said, I guess I'll also have to find a core fix for that as well, as otherwise an ill-beheaving plugin will still be able to crash the gateway. That's what I've tried doing so far, and I hope I'll get closer to a definitive solution soon enough.

francoisTemasys commented 10 years ago

Yes it fixed my crashes. I agree with the need to fix that in the janus core.

francoisTemasys commented 10 years ago

Solved in commit 35f2308ae10bb417e6e3de8a19d21f1766a3a086