Closed tmatth closed 1 year ago
This issue might be related to https://github.com/meetecho/janus-gateway/issues/3105
looking into this issue (excellent analysis @tmatth)
What we can't figure out is who is holding the lock for session->mutex in this context. The most likely candidate would be in the handling of the watch request, but from looking at the code paths it seems like wherever session->mutex is locked there is a path to it being unlocked
There is no thread in stack trace holding that mutex, afaict that might be due to:
1) A code path where the mutex is not released (unlikely, should be troubleshooted through mutex lock debugging)
2) The session
has been freed, and session->mutex
is a garbage pointer (I guess libasan is not enabled here, it should notify a use after free)
N.B. In some of the error cases, the order of mutexes being unlocked and the mountpoint reference count being decreased is not consistent, see here vs. here. It seems like the mountpoint's reference count should not be decreased while holding mp->mutex since janus_streaming_mountpoint_free will try to lock this mutex. This didn't show up in the backtraces so it seems unlikely to be related however, and also the reference count is incremented right before so it seems unlikely that it'd go to 0 here.
I agree, the order should be consistent and refcount decrement should happen after the mutex unlock.
N.B. In some of the error cases, the order of mutexes being unlocked and the mountpoint reference count being decreased is not consistent, see here vs. here. It seems like the mountpoint's reference count should not be decreased while holding mp->mutex since janus_streaming_mountpoint_free will try to lock this mutex. This didn't show up in the backtraces so it seems unlikely to be related however, and also the reference count is incremented right before so it seems unlikely that it'd go to 0 here.
I agree, the order should be consistent and refcount decrement should happen after the mutex unlock.
I've addressed this specific issue here https://github.com/meetecho/janus-gateway/pull/3145
Closing since #3246 has been merged.
Summary
We've observed a deadlock in
janus.plugin.streaming
in production several times since at least February and possibly earlier, however it'd be catastrophic enough that we'd never had the chance to do analysis on it.Thankfully, @lionelnicolas was able to cordon off the deadlocked janus instances so that we could examine them with GDB and thus we finally have some more detailed context. We're still missing some information, but in short:
While holding the
mountpoints_mutex
as of here,janus_streaming_mountpoint_destroy
(called from here) gets stuck waiting for themountpoint->thread
to join here: https://github.com/meetecho/janus-gateway/blob/4292ce5b33b1a2dd4ef1e653ea99f43857ab3fde/plugins/janus_streaming.c#L1309While holding the
mountpoint->mutex
as of here themountpoint->thread
never joins because it is stuck waiting to locksession->mutex
for one of its viewers here: https://github.com/meetecho/janus-gateway/blob/4292ce5b33b1a2dd4ef1e653ea99f43857ab3fde/plugins/janus_streaming.c#L8417What we can't figure out is who is holding the lock for
session->mutex
in this context. The most likely candidate would be in the handling of thewatch
request, but from looking at the code paths it seems like whereversession->mutex
is locked there is a path to it being unlocked.What version of Janus is this happening on? 34e3f4abc87ac546d22aeea3dc1bb1af0959cfe7 (0.x)
Have you tested a more recent version of Janus too? No, as we've only hit this in production and are only using 0.x there.
Was this working before? No, we've been trying to track down this deadlock for quite some time so it's not a recent regression.
Is there a gdb or libasan trace of the issue? We have stacktraces for 2 different workloads which exhibit this deadlock (albeit with different impacts on the surrounding threads):
This workload was with janus.plugin.videoroom + janus.plugin.streaming
``` Thread 24 (Thread 0x7ffa16ffd700 (LWP 123)): #0 __pthread_clockjoin_ex (threadid=140712087631616, thread_return=0x0, clockid=This workload was with just the janus.plugin.streaming plugin
``` Thread 43 (Thread 0x7f34deffd700 (LWP 507)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f3596512a5c in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #2 0x00007f3594553d5f in janus_streaming_process_synchronous_request (session=0x7f355c020770, message=0x7f3544024310) at plugins/janus_streaming.c:3838 #3 0x00007f35945608ae in janus_streaming_handle_message (handle=0x7f355c001960, transaction=0x7f35440248f0 "zlnuvkaw13tq", message=0x7f3544024310, jsep=0x0) at plugins/janus_streaming.c:4364 #4 0x00005579f7c7b60f in janus_process_incoming_request (request=0x7f3544024520) at janus.c:1717 #5 0x00005579f7c89edf in janus_transport_task (data=0x7f3544024520, user_data=0x0) at janus.c:3432 #6 0x00007f35964f0374 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #7 0x00007f35964efad1 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #8 0x00007f3595ec1609 in start_thread (arg=Additional context
We are using the WS transport and our mountpoints are all live RTP sources. We are not using SVC or helper threads.
N.B. In some of the error cases, the order of mutexes being unlocked and the mountpoint reference count being decreased is not consistent, see here vs. here. It seems like the mountpoint's reference count should not be decreased while holding
mp->mutex
sincejanus_streaming_mountpoint_free
will try to lock this mutex. This didn't show up in the backtraces so it seems unlikely to be related however, and also the reference count is incremented right before so it seems unlikely that it'd go to 0 here.