meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.25k stars 2.48k forks source link

Fix simulated leave message for longer string ID rooms #3243

Closed adnanel closed 1 year ago

adnanel commented 1 year ago

Currently, on abrupt disconnections, participants get stuck in text rooms and aren't able to re-join immediately.

This makes auto reconnect mechanisms needlessly complicated.

In this PR I added automatic leaving of all rooms when the session is being destroyed.

The bug can be reproduced in the official janus textroom demo:

lminiero commented 1 year ago

Thanks for the patch, but I think you're doing those things in the wrong place. You've added some logic to the destroy_session callack, but it's actually the hangup_media_internal one where those things should be done, since once the PeerConnection is gone (whether the session exists or not) then all relationships to room resources are gone. As a matter of fact, we have code for that already:

https://github.com/meetecho/janus-gateway/blob/master/src/plugins/janus_textroom.c#L3009-L3043

which is supposed to do exactly that, that is leaving rooms you're in and destroying the participants resources. If you think that's not working as expected for some reason, please check what needs to be fixed there instead.

lminiero commented 1 year ago

I just tried to replicate your steps, and it seems to work just fine in my case. Obviously the user only leaves once the session timeout kicks in, so ~60s later by default, but the same thing would happen with your patch too, since you're intercepting destroy_session which is triggered by a handle detach, so only once the session timeout gets rid of all related handles.

adnanel commented 1 year ago

If what you're saying is correct (I'm already finishing for the day so can't confirm), the bug I'm trying to fix isn't the same issue I found on the demo, so I should clarify our original problem, then:

We are seeing participants in text rooms well over 60 seconds (instead, they are permanently stuck). The handle and session are both destroyed, but the other participants still see them and they are no longer able to join.

It only works if they properly perform "leave" before. Also the version from this branch seems to also fix our problem.

I will try to analyze it further and see what actually happens.

lminiero commented 1 year ago

Then yes, the problem may be in the code I linked above, which is where the logic for "let's remove the user" is, so that's where I'd dig: I couldn't replicate the problem, as the session timeout cleaned the user correctly for me. It may have something to do with the "fake" message we inject to simulate a "leave" message from the user: my guess is that you do end up calling that code, and the user is removed, but the event is not propagated as it should because of something that goes wrong with that "fake" message.

adnanel commented 1 year ago

Hi @lminiero

I had time to work on this again, turned out your leave command failed for us due to being an invalid JSON, which in turn is invalid due to our (string) room ID being too long (it's an UUID with a suffix).

I changed the PR to a much simpler version which only expands the request string buffer to 200 characters. Will test today and confirm whether this fixes everything.

Edit: Yep, seems to work now...

lminiero commented 1 year ago

@adnanel apologies for the very late response, I've been abroad for the IETF meeting and I'm catching up with all the pending activities. So it was just a matter of the buffer for the "fake" message being too short? If that's the case, it's indeed a very easy fix and I'll merge!