Extend sessions_mutex scope to avoid rogue inserts in mp viewers list

atoppi commented 1 year ago

This patch tries to address #3105 and #3108. We think that both issues are caused by an unwanted insert in the mountpoint viewers list, following this flow:

1) concurrent requests of plugin session destroy and watch (handled in two different threads) are received 2) the check for a valid session is satisfied at the beginning of janus_streaming_handler

        janus_mutex_lock(&sessions_mutex);
        janus_streaming_session *session = janus_streaming_lookup_session(msg->handle);
        if(!session) {
            janus_mutex_unlock(&sessions_mutex);
            JANUS_LOG(LOG_ERR, "No session associated with this handle...\n");
            janus_streaming_message_free(msg);
            continue;
        }

3) meanwhile janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes 3) the watch handler completes and ends up adding a dangling session pointer in the viewers list 4) from now on, bad things might happen

The patch is basically extending the scope of sessions_mutex in order to avoid the concurrent handling of a watch request and a janus_streaming_hangup_media_internal, thus restoring the correct order of execution.

atoppi commented 1 year ago

pinging @tmatth @lionelnicolas since they opened the mentioned issues

tmatth commented 1 year ago

pinging @tmatth @lionelnicolas since they opened the mentioned issues

@atoppi I'm not able to test currently but concurrent watch + destroy definitely sounds like the type of condition we would have hit when we saw this.

lionelnicolas commented 1 year ago

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

atoppi commented 1 year ago

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

Sounds good 👍 Do your scripts check the current number of viewers in order to assert a condition (e.g. should be zero) ? The bare insert of unwanted sessions in the list is enough to trigger undefined behaviors.

lionelnicolas commented 1 year ago

No it's not checking that number. It's just doing a sequence of (create mp / add viewer / destroy mp) with random timings, and it only exits if mp creation failed (like already existing or API deadlock/timeout)

If I can find some time today I'll try to refine my script to add the number of viewer check, and maybe also put mp creation/destruction in a different thread than viewer join/leave (to try to get that watch/destroy race condition)

lionelnicolas commented 1 year ago

3. janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes

On a side note, when we encounter this issue, it seems to happen a lot more frequently when a peer with a very bad network is involved (which could cause the watch request to take longer to complete ?)

atoppi commented 1 year ago

I'm not sure that mp create/destroy is strictly needed to hit the issue. A mp destroy will trigger existing sessions destroy of course, and those sessions teardown, along with a concurrent watch request, might lead to unwanted inserts in the viewers list.

As of bad network peers, maybe a poor connection is closing the transport (e.g. websocket), leading to session tear down, while handling a watch request? In general the time needed to handle a watch request is basically equal to mp/sessions/configuration checks + JSEP offer generation.

lionelnicolas commented 1 year ago

Maybe I should try to:

[mp] create mp
[viewer] attach viewer
[viewer] send watch
[viewer] close WS immediately after sending watch
[mp] immediately destroy mp

(and try to execute 4. and 5. at the same time ?)

Or instead of destroying the mp, just check the viewer count of the mp via the admin API ?

atoppi commented 1 year ago

I'd say just create mp once or reuse a static one, then in parallel threads:

1) attach 2) watch (don't wait response) 3) quickly send a random command among close ws, detach handle, destroy session

After N iterations, check that mp viewers count is 0. Of course the test might stop early when libasan detects a corruption.

lionelnicolas commented 1 year ago

Ok thanks I'll try to check that today

lionelnicolas commented 1 year ago

I may have good news :smile:

Janus setup

Compiled with libasan and libubsan. Running with mimalloc disabled.

Stress test

I've added this to my janus CLI client, just after sending the watch command:

(...)

            rand_action = random.choice([1, 2, 3, 4])

            if rand_action == 1:
                logger.warning("STRESS : killing websocket")
                await self.ws.close()

            elif rand_action == 2:
                logger.warning("STRESS : detaching plugin")
                await self.send(
                    {
                        "janus": "detach",
                        "session_id": self.data.get("session_id"),
                        "handle_id": self.data.get("handle_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            elif rand_action == 3:
                logger.warning("STRESS : destroying session")
                await self.send(
                    {
                        "janus": "destroy",
                        "session_id": self.data.get("session_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            else:
                logger.warning("STRESS : running normal behaviour (watch stream for 2 seconds)")

(...)

Then run 15 parallel jobs with 12000 iterations:

parallel -n0 --line-buffer --jobs 15 "./stress STREAM-TOKEN" ::: $(seq 12000)

Without applying this PR

After 12000 iterations, mountpoint_viewers was 6 after all viewers were torn down.

Using this PR

After 12000 iteration, the mountpoint_viewers was 0.

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

lionelnicolas commented 1 year ago

To be sure, I wanted to try with 25 parallel jobs:

parallel -n0 --line-buffer --jobs 25 "./stress STREAM-TOKEN" ::: $(seq 8000)

Without applying this PR

After 8000 iterations, mountpoint_viewers was 3 after all viewers were torn down.

Using this PR

After 8000 iteration, the mountpoint_viewers was 0.

lionelnicolas commented 1 year ago

I can't approve the PR, but LGTM :+1:

tmatth commented 1 year ago

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

I imagine that it's because although you have dangling pointers in your test, if you're not actually dereferencing them at some point after they've been freed asan won't complain.

atoppi commented 1 year ago

@lionelnicolas excellent work! That proves the patch is working as intended and will fix at least the "already watching" issue.

@tmatth you're right, the bare insert of a dangling ptr is not enough to trigger the sanitizers, you need to deref it.

lminiero commented 1 year ago

Thanks for all the feedback! I'll merge then, and work on a port of the same fix to 1.x too.

lminiero commented 1 year ago

Done for 1.x in the PR above, please double check I didn't miss anything.

meetecho / janus-gateway