Closed lminiero closed 1 year ago
While there still are some race conditions to address, they seem to be happening in different places for now. As such, I'll start merging these fixes as they help greatly already. We'll address the other problems #3154 spotted in a new PR.
This is a first attempt to fix a nasty (and rare) race condition that can occur in the VideoRoom, and that has been discovered thanks to logs provided in #3154: there's very good chances that #3124 is a duplicate of the same issue, which is why I'm grouping them together here.
The issue that happens is basically the following:
destroy_session
andhangup_media
are called, which do nothing since at that point the handle is still neither a publisher nor a subscriber;The problem is clearly the race condition between the "join" processing and the detach, which causes a
janus_videoroom_subscriber
instance to keep on living, even though its session has gone, and refering a session pointer that's now garbage. It's not something that can be easily fixed with just an additional reference, since that may solve the crash, but would leave a leak: in fact, the problem is that the subscriber instance is still in the publisher's list of recipients, and that subscriber can only be removed by adestroy_session
and/orhangup_media
, but those happened already and are not coming back (the session is over).As such, the fix this patch attempts is using
session_mutex
on a "join" (for both publishers and subscribers) as a way to ensure the race condition doesn't happen. In fact,session_mutex
is what we use to protect the sessions hashtable, which means it's used when a session is marked as destroyed: we already had checks in "join" for whether the session was destroyed or not, which could be bypassed in case of race conditions like the one above, and now should instead do their job properly.Marking this PR as a draft now, so that those who reported the problem can test it, and ensure there's no regression. From a quick check I don't think this new mutex usage should cause any instance of inverse lock order problems, but some more testing should help verify no deadlock will occur now. The PR is also a draft just in case it turns out a different mutex may be a better option (e.g.,
session->mutex
). As usual, feedback is more than welcome: I hope that, considering many people use the VideoRoom, we'll get more feedback than usual.