meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.25k stars 2.48k forks source link

Fix for VideoRoom race condition (see #3124 and 3154) #3167

Closed lminiero closed 1 year ago

lminiero commented 1 year ago

This is a first attempt to fix a nasty (and rare) race condition that can occur in the VideoRoom, and that has been discovered thanks to logs provided in #3154: there's very good chances that #3124 is a duplicate of the same issue, which is why I'm grouping them together here.

The issue that happens is basically the following:

  1. a new handle is attached to the VideoRoom;
  2. a "join" as a subscriber is sent on that handle, which is queued for processing;
  3. in the meanwhile, the handle is detached, so destroy_session and hangup_media are called, which do nothing since at that point the handle is still neither a publisher nor a subscriber;
  4. in parallel, the "join" request is processed and goes through, adding the subscriber as a recipient for the specified publisher(s), but the session is now freed;
  5. some time later, the publisher goes away, and when accessing the session for that subscriber, there's a crash since session has been freed already.

The problem is clearly the race condition between the "join" processing and the detach, which causes a janus_videoroom_subscriber instance to keep on living, even though its session has gone, and refering a session pointer that's now garbage. It's not something that can be easily fixed with just an additional reference, since that may solve the crash, but would leave a leak: in fact, the problem is that the subscriber instance is still in the publisher's list of recipients, and that subscriber can only be removed by a destroy_session and/or hangup_media, but those happened already and are not coming back (the session is over).

As such, the fix this patch attempts is using session_mutex on a "join" (for both publishers and subscribers) as a way to ensure the race condition doesn't happen. In fact, session_mutex is what we use to protect the sessions hashtable, which means it's used when a session is marked as destroyed: we already had checks in "join" for whether the session was destroyed or not, which could be bypassed in case of race conditions like the one above, and now should instead do their job properly.

Marking this PR as a draft now, so that those who reported the problem can test it, and ensure there's no regression. From a quick check I don't think this new mutex usage should cause any instance of inverse lock order problems, but some more testing should help verify no deadlock will occur now. The PR is also a draft just in case it turns out a different mutex may be a better option (e.g., session->mutex). As usual, feedback is more than welcome: I hope that, considering many people use the VideoRoom, we'll get more feedback than usual.

lminiero commented 1 year ago

While there still are some race conditions to address, they seem to be happening in different places for now. As such, I'll start merging these fixes as they help greatly already. We'll address the other problems #3154 spotted in a new PR.