meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.02k stars 2.45k forks source link

Crash when starting RTP forward #1694

Closed vivaldi-va closed 5 years ago

vivaldi-va commented 5 years ago

Previous discussion/findings found at https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/meetecho-janus/foK7roAutdI

Issue seems to occur after https://github.com/meetecho/janus-gateway/commit/947bfb0ae5d5c1e29224c71d631607b6de1869c5

and possibly replicates the crash noted in https://github.com/meetecho/janus-gateway/issues/1605

Environment used to run the test was Ubuntu 18.04 inside a Docker container

lminiero commented 5 years ago

If libasan and valgrind don't help, you should at least try and grab a gdb dump. I can't replicate the issue.

lminiero commented 5 years ago

Scratch that, managed to grab a core dump:

#0  0x00007f18d81c8908 in g_source_unref_internal (source=0x7f1868005f70, context=0x26d6ac0, have_lock=1) at gmain.c:2106
#1  0x00007f18d81c8bbe in g_source_iter_next (iter=iter@entry=0x7f1866ffc930, source=source@entry=0x7f1866ffc928) at gmain.c:980
#2  0x00007f18d81cb0d3 in g_main_context_prepare (context=context@entry=0x26d6ac0, priority=priority@entry=0x7f1866ffc9b0) at gmain.c:944
#3  0x00007f18d81cbb8b in g_main_context_iterate (context=0x26d6ac0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3882
#4  0x00007f18d81cc012 in g_main_loop_run (loop=0x26d5be0) at gmain.c:4098
#5  0x00007f187c2cb855 in janus_videoroom_rtp_forwarder_rtcp_thread (data=<optimized out>) at plugins/janus_videoroom.c:6609
#6  0x00007f18d81f401a in g_thread_proxy (data=0x2614370) at gthread.c:784
#7  0x00007f18d6963594 in start_thread () at /lib64/libpthread.so.0
#8  0x00007f18d6697e5f in clone () at /lib64/libc.so.6

Line numbers may differ a bit, as initially it wouldn't crash for me, but just deadlock in the VideoRoom plugin, about when stop_rtp_forward is called. Apparently it would deadlock when removing the forwarder from the hashtable (the unlock for the stuck lock is immediately after that), which means somewhere in janus_videoroom_rtp_forwarder_destroy, that is the function invoked when an element is removed from that table. In that method we have a g_source_destroy call when RTCP is used, since we use Glib sockets to monitor incoming RTCP packets: I suspect something is going wrong there, either causing locks or just crashing when the event loop tries to "use" the source. I'll have to give this some more thought, but that's where I believe things are breaking, rather than because of the extra unref.

lminiero commented 5 years ago

If the extra unref is involved, though, I wonder if the cause is janus_videoroom_rtp_forwarder_rtcp_finalize... in theory, when we get rid of the source, that callback should be invoked, and that does have the unref. Anyway, it looks like it's never happening: or maybe it sometimes happens, and so we do have two unrefs where there should just be one.

lminiero commented 5 years ago

The commit above fixes it for me. Closing, feel free to reopen if still an issue.

vivaldi-va commented 5 years ago

That seemed to have done the trick. Nice one!