meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.17k stars 2.47k forks source link

[1.x] Videoroom use-after-free when unsubscribing and closing a room #3242

Closed thatsmydoing closed 1 year ago

thatsmydoing commented 1 year ago

What version of Janus is this happening on? 6d2dd1b44d16d51c086f58f5ffc59c87a5e9b0c7

Have you tested a more recent version of Janus too? Yes. fe993140577ff91ec626eb7b5f03a2898809674f

Was this working before? I'm not sure, it also crashes on 1.0.4. In practice, it doesn't crash that often. Maybe once a day in our usage.

Is there a gdb or libasan trace of the issue? Reproduced locally against fe993140577ff91ec626eb7b5f03a2898809674f, https://gist.github.com/thatsmydoing/c8e82a531f3431cb8f066e4b54cb6dec

Additional context I can reproduce it pretty reliably with https://github.com/thatsmydoing/janus-gateway-2775/

lminiero commented 1 year ago

Ah, it looks like a cleanup code stepping on its own feet: we're looping on stream->publisher_streams, which is a GList, but while iterating there (list = list->next), we call janus_videoroom_subscriber_stream_remove which updates that same list. As such, the list pointer we're using may not be a valid pointer anymore, because destroyed in the process, hence the crash. I'll try to remember why exactly the code was written like that, and to come up with a fix.

lminiero commented 1 year ago

If only I could replicate the problem, though... I'm at attempt 900 with your tool and still nothing :rofl:

thatsmydoing commented 1 year ago

Oh no. On my machine I get the error on the 1st attempt. Let me check again.

lminiero commented 1 year ago

I stopped the tool at about 2000 attempts, as I don't think it's going to die on me at this point :slightly_smiling_face:

thatsmydoing commented 1 year ago

I'm not sure what to say, I tried in a different environment and it's still consistently crashing on the first attempt. If you've cloned the repository before, I've pushed some changes which you may not have pulled? It also only reliably crashes with asan, otherwise it just keeps going.

Otherwise, it might be the library versions being used? I'll see if I can make docker reproduction case.

lminiero commented 1 year ago

No, there's no further commits if I do a git pull, and my local Janus instance is indeed always built with libasan, since I use that to track issues I stumble upon while developing.

thatsmydoing commented 1 year ago

I think this depends on the environment. I can reliably reproduce on nix and Archlinux but not on Debian 12. I'll see if using the same lib versions as Debian avoid the issue.

thatsmydoing commented 1 year ago

I've yet to narrow it down to the exact library, but downgrading libraries does make it non-reproducible.

Bad (Reproduces Crash)
gcc-12.3.0
glib-2.76.3
glibc-2.37-8
jansson-2.14
libconfig-1.7.3
libnice-0.1.21
libsrtp-2.5.0
libwebsockets-4.1.6
openssl-3.0.9
zlib-1.2.13
Good (No Crash)
gcc-11.3.0
glib-2.74.3
glibc-2.35-224
jansson-2.14
libconfig-1.7.3
libnice-0.1.18
libsrtp-2.4.2
libwebsockets-4.1.6
openssl-3.0.9
zlib-1.2.13

My guess is it might be glib as most of the lib versions under "Bad" match that of Debian 12. I'll dig a bit more.

thatsmydoing commented 1 year ago

I can confirm that downgrading glib to 2.74 seems to fix the issue for us (at least with the provided test case). So I guess there's something in glib 2.76 that breaks things?

lminiero commented 1 year ago

I'm on 2.74 too (2.74.7 on Fedora). That said, my guess is that part I mentioned is where things break, so I'll come back to you with a quick patch you can try to see if it improves things.

lminiero commented 1 year ago

@thatsmydoing can you check if this tiny patch helps?

diff --git a/src/plugins/janus_videoroom.c b/src/plugins/janus_videoroom.c
index a7a7c2b5..9bf3a053 100644
--- a/src/plugins/janus_videoroom.c
+++ b/src/plugins/janus_videoroom.c
@@ -10432,10 +10432,10 @@ static void *janus_videoroom_handler(void *data) {
                                        list = list->next;
                                        continue;
                                    }
-                                   janus_videoroom_subscriber_stream_remove(stream, ps, TRUE);
                                    if(stream->type != JANUS_VIDEOROOM_MEDIA_DATA)
                                        changes++;
                                    list = list->next;
+                                   janus_videoroom_subscriber_stream_remove(stream, ps, TRUE);
                                }
                                temp = temp->next;
                            }

All it does is moving the janus_videoroom_subscriber_stream_remove to the end of the inner loop, so that updating the pointer while traversing is not impacted in case that pointer is the one that's destroyed. I have not tested this myself.

thatsmydoing commented 1 year ago

That might be it. With the patch, even on glib 2.76, I can't reproduce it with my test case anymore.

lminiero commented 1 year ago

That's good news! You may want to ensure you don't see any other regression though. If not, I can push the fix upstream.

thatsmydoing commented 1 year ago

I'll be trying this out on production today. I'll let you know around next week if we see any more crashes.

thatsmydoing commented 1 year ago

With the patch and glib 2.76, we unfortunately had a crash elsewhere this time

https://github.com/meetecho/janus-gateway/blob/e1c7704cfdef76cd5b177ba7fbb92c041bddd40e/src/plugins/janus_videoroom.c#L8582 from https://github.com/meetecho/janus-gateway/blob/e1c7704cfdef76cd5b177ba7fbb92c041bddd40e/src/plugins/janus_videoroom.c#L8445

lminiero commented 1 year ago

Please share dumps as you did last time.

thatsmydoing commented 1 year ago

I couldn't reproduce it locally, so I only have a gdb backtrace from coredumps.

The first logs are running on https://github.com/thatsmydoing/janus-gateway/commit/155c45638bb4b20d2e9c4799f5c706ee578f3f9e, this is branched off https://github.com/meetecho/janus-gateway/commit/6d2dd1b44d16d51c086f58f5ffc59c87a5e9b0c7 and includes your patch as well as an unrelated patch that shouldn't affect this.

Logs
[7825799502076340] Updating existing session
[7540571102874557] Updating existing session
[3620882643052495] WebRTC resources freed; 0x7f3ab82b2820 0x7f3ac805a410
[3620882643052495] Handle and related resources freed; 0x7f3ab82b2820 0x7f3ac805a410
[7825799502076340] Negotiation update, checking what changed...
[7540571102874557] Negotiation update, checking what changed...
[4877804757315329] Updating existing session
[WARN] [8270445183568114] Didn't receive video (substream #2) for more than 1 second(s)...
[4877804757315329] Negotiation update, checking what changed...
[WARN] [4287692750950437] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [3252118253804503] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/20 08:41:09:2406] N: __lws_lc_tag:  ++ [wsisrv|3613|adopted] (39)
[libwebsockets][NOTICE] [2023/07/20 08:41:09:2407] N: __lws_lc_untag:  -- [wsisrv|3613|adopted] (38) 105μs
[WARN] No packet received on substream 1 for a while, falling back to 0
[WARN] [8270445183568114] Didn't receive video (substream #2) for more than 1 second(s)...
[WARN] [8270445183568114] Didn't receive audio for more than 1 second(s)...
[WARN] No packet received on substream 1 for a while, falling back to 0
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab8098760 0x7f3ab81293e0 0x7f3ab8098760 0x7f3ab831cdf0
[janus.plugin.videoroom-0x7f3ab81293e0] No WebRTC media anymore; 0x7f3ab8098760 0x7f3ab831cdf0
[6967737102373463] Updating existing session
[8270445183568114] WebRTC resources freed; 0x7f3ab8098760 0x7f3ab80a99e0
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab81ca960 0x7f3ab8369a40 0x7f3ab81ca960 0x7f3ab81ef150
[janus.plugin.videoroom-0x7f3ab8003630] No WebRTC media anymore; 0x7f3ab82131e0 0x7f3ab8305800
[janus.plugin.videoroom-0x7f3ab8369a40] No WebRTC media anymore; 0x7f3ab81ca960 0x7f3ab81ef150
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab82131e0 0x7f3ab8003630 0x7f3ab82131e0 0x7f3ab8305800
[8270445183568114] Handle and related resources freed; 0x7f3ab8098760 0x7f3ab80a99e0
GDB Backtrace
#0  janus_videoroom_hangup_media_internal (session_data=session_data@entry=0x7f3ab8305800) at plugins/janus_videoroom.c:8620
#1  0x00007f3acd0cf9a0 in janus_videoroom_hangup_media (handle=) at plugins/janus_videoroom.c:8483
#2  0x0000000000448790 in janus_ice_outgoing_traffic_handle ()
#3  0x000000000044b701 in janus_ice_outgoing_traffic_dispatch ()
#4  0x00007f3ad03d2d7d in g_main_context_dispatch () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#5  0x00007f3ad03d3028 in g_main_context_iterate.constprop () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#6  0x00007f3ad03d32df in g_main_loop_run () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#7  0x000000000043d81c in janus_ice_handle_thread ()
#8  0x00007f3ad03ff64d in g_thread_proxy () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#9  0x00007f3acfbb6dd4 in start_thread () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#10 0x00007f3acfc389b0 in clone3 () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6

While debugging, I saw a similar bug to the one fixed by your patch and fixed it. Unfortunately, it was unrelated to the crash and didn't fix it. Here are the logs running against https://github.com/thatsmydoing/janus-gateway/commit/2d1b88a19b70bd92a685a7bba91c9c382b172154.

Logs
[3530032410032467] Updating existing session
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video (substream #2) for more than 1 second(s)...
[3530032410032467] Negotiation update, checking what changed...
[2740559208127164] Negotiation update, checking what changed...
[WARN] [4890055471783842] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video for more than 1 second(s)...
[WARN] [5426595611094146] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:01:48:8014] N: __lws_lc_tag:  ++ [wsisrv|29cc|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:01:48:8015] N: __lws_lc_untag:  -- [wsisrv|29cc|adopted] (37) 91μs
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:01:58:8017] N: __lws_lc_tag:  ++ [wsisrv|29cd|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:01:58:8018] N: __lws_lc_untag:  -- [wsisrv|29cd|adopted] (37) 68μs
[WARN] [6318269714873560] Didn't receive video for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [6318269714873560] Unknown SSRC, dropping packet (SSRC 966292906)...
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:02:08:8017] N: __lws_lc_tag:  ++ [wsisrv|29ce|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:08:8017] N: __lws_lc_untag:  -- [wsisrv|29ce|adopted] (37) 92μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[libwebsockets][NOTICE] [2023/07/23 11:02:18:8015] N: __lws_lc_tag:  ++ [wsisrv|29cf|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:18:8016] N: __lws_lc_untag:  -- [wsisrv|29cf|adopted] (37) 83μs
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:02:28:8014] N: __lws_lc_tag:  ++ [wsisrv|29d0|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:28:8015] N: __lws_lc_untag:  -- [wsisrv|29d0|adopted] (37) 76μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[libwebsockets][NOTICE] [2023/07/23 11:02:38:8013] N: __lws_lc_tag:  ++ [wsisrv|29d1|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:38:8014] N: __lws_lc_untag:  -- [wsisrv|29d1|adopted] (37) 80μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[janus.plugin.videoroom-0x7f7590335020] No WebRTC media anymore; 0x7f7590052320 0x7f75901da380
[3658334612854888] Updating existing session
[4278699205407511] WebRTC resources freed; 0x7f7590052320 0x7f759040dd90
Detaching handle from JANUS VideoRoom plugin; 0x7f7590052320 0x7f7590335020 0x7f7590052320 0x7f75901da380
[4278699205407511] Handle and related resources freed; 0x7f7590052320 0x7f759040dd90
Detaching handle from JANUS VideoRoom plugin; 0x7f7590372010 0x7f7590002330 0x7f7590372010 0x7f759006a530
Detaching handle from JANUS VideoRoom plugin; 0x7f759002f890 0x7f759002dcb0 0x7f759002f890 0x7f759003fff0
[janus.plugin.videoroom-0x7f7590002330] No WebRTC media anymore; 0x7f7590372010 0x7f759006a530
[janus.plugin.videoroom-0x7f759002dcb0] No WebRTC media anymore; 0x7f759002f890 0x7f759003fff0
GDB Backtrace
#0  0x00007f75a4e9c538 in janus_videoroom_hangup_media_internal (session_data=session_data@entry=0x7f759003fff0) at plugins/janus_videoroom.c:8621
#1  0x00007f75a4e9d9a0 in janus_videoroom_hangup_media (handle=) at plugins/janus_videoroom.c:8483
#2  0x0000000000448790 in janus_ice_outgoing_traffic_handle ()
#3  0x000000000044b701 in janus_ice_outgoing_traffic_dispatch ()
#4  0x00007f75a81a0d7d in g_main_context_dispatch () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#5  0x00007f75a81a1028 in g_main_context_iterate.constprop () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#6  0x00007f75a81a12df in g_main_loop_run () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#7  0x000000000043d81c in janus_ice_handle_thread ()
#8  0x00007f75a81cd64d in g_thread_proxy () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#9  0x00007f75a7984dd4 in start_thread () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#10 0x00007f75a7a069b0 in clone3 () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
lminiero commented 1 year ago

A libasan trace would be better.

thatsmydoing commented 1 year ago

Here's one https://gist.github.com/thatsmydoing/05750ff32b8e5b0f3d3fdbd6f813d1ad running against https://github.com/thatsmydoing/janus-gateway/commit/2d1b88a19b70bd92a685a7bba91c9c382b172154

I do have another one that has a slightly different error and line https://gist.github.com/thatsmydoing/8a577c6c26005cde8ef70b6e0e181c09

lminiero commented 1 year ago

Looks like a race condition, that I think we correctly identified now. Unfortunately it's a also a bit tricky to address and solve. I'll let you know when there's a patch you can test.

lminiero commented 1 year ago

@thatsmydoing please test the PR above. It already includes the other small change I pasted a few weeks ago, so you' won't need to reapply that yourself again.

lminiero commented 1 year ago

@thatsmydoing ping :slightly_smiling_face:

thatsmydoing commented 1 year ago

Ah sorry, I have it running now and it seems good so far. I'll update you next week after having let it run for a bit more.

I think the PR also needs

Patch
diff --git a/src/plugins/janus_videoroom.c b/src/plugins/janus_videoroom.c
index cdf42837..fb0ba279 100644
--- a/src/plugins/janus_videoroom.c
+++ b/src/plugins/janus_videoroom.c
@@ -8633,8 +8633,8 @@ static void janus_videoroom_hangup_media_internal(gpointer session_data) {
                                        }
                                        list = list->next;
                                }
+                               temp = temp->next;
                                janus_videoroom_subscriber_stream_remove(s, NULL, TRUE);
-                               temp = temp->next;
                        }
                        /* Free streams */
                        g_list_free(subscriber->streams);
lminiero commented 1 year ago

No it's there already.

lminiero commented 1 year ago

Ah you mean a different patch than mine I guess?

thatsmydoing commented 1 year ago

Yeah, it's a similar fix to the first patch, just in a different area.

lminiero commented 1 year ago

Found another one too, pushed.

thatsmydoing commented 1 year ago

I've not seen any crashes since I used the PR patchset so I think it's good.

lminiero commented 1 year ago

Ack, thanks for the feedback! I'll merge then.