Closed atoppi closed 1 year ago
pinging @tmatth @lionelnicolas since they opened the mentioned issues
pinging @tmatth @lionelnicolas since they opened the mentioned issues
@atoppi I'm not able to test currently but concurrent watch + destroy definitely sounds like the type of condition we would have hit when we saw this.
@atoppi Thank you for this !
janus_streaming_hangup_media_internal
definitely rings a bell when remembering investigations we did on that issue.
I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.
I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.
@atoppi Thank you for this !
janus_streaming_hangup_media_internal
definitely rings a bell when remembering investigations we did on that issue.I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.
I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.
Sounds good 👍 Do your scripts check the current number of viewers in order to assert a condition (e.g. should be zero) ? The bare insert of unwanted sessions in the list is enough to trigger undefined behaviors.
No it's not checking that number. It's just doing a sequence of (create mp / add viewer / destroy mp) with random timings, and it only exits if mp creation failed (like already existing
or API deadlock/timeout)
If I can find some time today I'll try to refine my script to add the number of viewer check, and maybe also put mp creation/destruction in a different thread than viewer join/leave (to try to get that watch/destroy race condition)
3.
janus_streaming_hangup_media_internal
(called by session destroy) is handled before thewatch
request completes
On a side note, when we encounter this issue, it seems to happen a lot more frequently when a peer with a very bad network is involved (which could cause the watch
request to take longer to complete ?)
I'm not sure that mp create/destroy is strictly needed to hit the issue. A mp destroy will trigger existing sessions destroy of course, and those sessions teardown, along with a concurrent watch request, might lead to unwanted inserts in the viewers list.
As of bad network peers, maybe a poor connection is closing the transport (e.g. websocket), leading to session tear down, while handling a watch request? In general the time needed to handle a watch request is basically equal to mp/sessions/configuration checks + JSEP offer generation.
Maybe I should try to:
watch
watch
(and try to execute 4. and 5. at the same time ?)
Or instead of destroying the mp, just check the viewer count of the mp via the admin API ?
I'd say just create mp once or reuse a static one, then in parallel threads:
1) attach 2) watch (don't wait response) 3) quickly send a random command among close ws, detach handle, destroy session
After N iterations, check that mp viewers count is 0. Of course the test might stop early when libasan detects a corruption.
Ok thanks I'll try to check that today
I may have good news :smile:
Compiled with libasan
and libubsan
.
Running with mimalloc
disabled.
I've added this to my janus CLI client, just after sending the watch
command:
(...)
rand_action = random.choice([1, 2, 3, 4])
if rand_action == 1:
logger.warning("STRESS : killing websocket")
await self.ws.close()
elif rand_action == 2:
logger.warning("STRESS : detaching plugin")
await self.send(
{
"janus": "detach",
"session_id": self.data.get("session_id"),
"handle_id": self.data.get("handle_id"),
"transaction": self.transaction_create("instantiate-listener"),
}
)
await asyncio.sleep(0.2)
await self.ws.close()
elif rand_action == 3:
logger.warning("STRESS : destroying session")
await self.send(
{
"janus": "destroy",
"session_id": self.data.get("session_id"),
"transaction": self.transaction_create("instantiate-listener"),
}
)
await asyncio.sleep(0.2)
await self.ws.close()
else:
logger.warning("STRESS : running normal behaviour (watch stream for 2 seconds)")
(...)
Then run 15 parallel jobs with 12000 iterations:
parallel -n0 --line-buffer --jobs 15 "./stress STREAM-TOKEN" ::: $(seq 12000)
After 12000 iterations, mountpoint_viewers
was 6
after all viewers were torn down.
After 12000 iteration, the mountpoint_viewers
was 0
.
I've seen not seen any asan
/ ubsan
warnings in the logs when mountpoint_viewers
was non-zero.
To be sure, I wanted to try with 25 parallel jobs:
parallel -n0 --line-buffer --jobs 25 "./stress STREAM-TOKEN" ::: $(seq 8000)
After 8000 iterations, mountpoint_viewers
was 3
after all viewers were torn down.
After 8000 iteration, the mountpoint_viewers
was 0
.
I can't approve the PR, but LGTM :+1:
Notes
I've seen not seen any
asan
/ubsan
warnings in the logs whenmountpoint_viewers
was non-zero.
I imagine that it's because although you have dangling pointers in your test, if you're not actually dereferencing them at some point after they've been freed asan won't complain.
@lionelnicolas excellent work! That proves the patch is working as intended and will fix at least the "already watching" issue.
@tmatth you're right, the bare insert of a dangling ptr is not enough to trigger the sanitizers, you need to deref it.
Thanks for all the feedback! I'll merge then, and work on a port of the same fix to 1.x
too.
Done for 1.x
in the PR above, please double check I didn't miss anything.
This patch tries to address #3105 and #3108. We think that both issues are caused by an unwanted insert in the mountpoint viewers list, following this flow:
1) concurrent requests of plugin session destroy and
watch
(handled in two different threads) are received 2) the check for a valid session is satisfied at the beginning ofjanus_streaming_handler
3) meanwhile
janus_streaming_hangup_media_internal
(called by session destroy) is handled before thewatch
request completes 3) thewatch
handler completes and ends up adding a dangling session pointer in the viewers list 4) from now on, bad things might happenThe patch is basically extending the scope of
sessions_mutex
in order to avoid the concurrent handling of awatch
request and ajanus_streaming_hangup_media_internal
, thus restoring the correct order of execution.