[1.x] Sip session get stuck - Githubissues

meetecho / janus-gateway

Janus WebRTC Server

https://janus.conf.meetecho.com

GNU General Public License v3.0

8.15k stars 2.47k forks source link

[1.x] Sip session get stuck #3332

Closed Jonbeckas closed 5 months ago

Jonbeckas commented 6 months ago

What version of Janus is this happening on? 1.2.2; b98e3bb91bd728ce21f6fd56519a303f2775f755

Have you tested a more recent version of Janus too? Yes, on the master branch

Was this working before? Not sure, behaviour of the telephone system we use changed (3cx)

Additional context If a session has been hangup, the JANUS_ICE_WEBRTC_ALERT flag will be set in janus_ice_webrtc_hangup and removed in the next call in janus_ice_setup_local which is called by janus_plugin_handle_sdp. For a denied incoming sip call with no sdp body, the JANUS_ICE_WEBRTC_ALERT will not be removed and during the following hangup the janus_ice_webrtc_hangup will be abortet before the plugin is notified and the establishing attribute will be set to 0, so the session will get stuck and denies all incoming calls.

lminiero commented 6 months ago

If you can provide replication steps (including maybe a sipp script), I'll try to have a look at what the issue might be. As I wrote in reply to the PR, I don't think the patch is a proper fix, since it would introduce different problems, and so I'd like to investigate a different solution.

Jonbeckas commented 6 months ago

The replication steps are:

Get INVITE without sdp offer
Decline it with Janus
Get another INVITE without an offer
Decline it with Janus
Janus Sip Plugin will keep the session as establishing an deny further calls

I am not really familliar with sipp, but i try to add a sipp script later.

lminiero commented 6 months ago

If these invites are offerless, then I don't think the core or alert states have anything to do with it: there wouldn't be any SDP to trigger a new PeerConnection establishment. It's much more likely an inconsistent state within the SIP plugin itself. I'll try to replicate and let you know.

lminiero commented 6 months ago

I think I have a better understanding now, and why you were trying to tinker with the alert flag. It's true, as I said, that there's no PeerConnection establishment involved, but that's apparently the very root of the issue, rather than the reason why it shouldn't happen.

Basically, an offerless INVITE means no SDP and so, again, no PC: at the same time, though, when you decline the call, we invoke the close_pc() function in the core from the plugin, to clean up any WebRTC resource that may have been allocated; this results in the alert flag being set to true, and the hangup_media() callback being called on the plugin, which resets the plugin flags (establishing, established). So the first time it happens, it works fine: the problem, though, is that there's no actual WebRTC cleanup happening (we never initialized a PC) and so alert stays true. At the second offerless INVITE, the same thing happens, but this time the call to close_pc() finds alert already true, which means hangup_media() is not called again on the plugin (we do that to avoid duplicates from the same event). As a result, the plugin establishing flag remains set, and further calls are automatically rejected, due to a broken stats in the plugin itself.

I'm wondering now what the right approach would be to address this. The "easy" fix would be to handle this directly in the SIP plugin, but in practice other plugins could in some cases end up in the same situation (even though it also depends on how they handle signaling, and the same two consecutive close_pc to two consecutive "no PC" should be happening, so much less likely). I'm still not convinced your PR addresses it properly, since it could break some core states. I'll think about it some more and let you know when I come up with a potential fix.

lminiero commented 6 months ago

@Jonbeckas can you try this diff?

diff --git a/src/ice.c b/src/ice.c
index da8ffd10..dc5ef226 100644
--- a/src/ice.c
+++ b/src/ice.c
@@ -1685,6 +1685,7 @@ static void janus_ice_webrtc_free(janus_ice_handle *handle) {
        return;
    janus_mutex_lock(&handle->mutex);
    if(!handle->agent_created) {
+       janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_ALERT);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_NEW_DATACHAN_SDP);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_READY);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_CLEANING);

This immediately resets the alert back to false if when we get to the point of freeing the resources (independently of how we got there) there's actually nothing to cleanup. In my local SIPp tests I can't replicate the issue anymore, but it would be better to test this in other SIP scenarios too. As soon as you can confirm it doesn't break anything for you, I'll push the fix upstream.

Jonbeckas commented 6 months ago

The patch seems to change the bug a bit for me, The scenario I described above does work now, but after I accept the call and later hang up, the next offerless call, that is denied will leave the session in an establishing=1 state again.

lminiero commented 6 months ago

Mh, thinking about it, that was to be expected, and would have happened even before this patch. When a regular call is closed, the same will happen (close_pc → hangup_media) but in this case alert will remain true: normally it's unset only when a new call starts, in fact. This means that after that successful call, a new offerless invite being declined will find alert set to true and not trigger the hangup_media call, thus causing the same problem as before.

In theory, the most obvious fix would be to ensure we reset the reset flag when we've cleaned up resources, but I'm wondering if that may cause issues in some cases. As I mentioned, we use that flag to also prevent multiple hangup_media occurrences (e.g., different things cause a PC to close), and having it reset right away instead of right before the next call may cause that to break. It may even cause a loop, if the pluginis wrongly wired (e.g., close_pc and hangup_media triggering each other). I'll think about this some more.

lminiero commented 6 months ago

While I think of the implications, you can give the following patch a try, which always resets the alert flag when cleaning WebRTC resources:

diff --git a/src/ice.c b/src/ice.c
index da8ffd10..96b149d1 100644
--- a/src/ice.c
+++ b/src/ice.c
@@ -1685,6 +1685,7 @@ static void janus_ice_webrtc_free(janus_ice_handle *handle) {
        return;
    janus_mutex_lock(&handle->mutex);
    if(!handle->agent_created) {
+       janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_ALERT);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_NEW_DATACHAN_SDP);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_READY);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_CLEANING);
@@ -1755,6 +1756,7 @@ static void janus_ice_webrtc_free(janus_ice_handle *handle) {
        janus_ice_notify_hangup(handle, handle->hangup_reason);
    }
    handle->hangup_reason = NULL;
+   janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_ALERT);
    janus_mutex_unlock(&handle->mutex);
    JANUS_LOG(LOG_INFO, "[%"SCNu64"] WebRTC resources freed; %p %p\n", handle->handle_id, handle, handle->session);
 }

Please let me know if you notice any regression.

Jonbeckas commented 6 months ago

The patch works like a charm for me.

lminiero commented 6 months ago

FYI, after careful consideration I've decided this will not be the patch I'll commit, due to the considerations I've made before. I'll instead ensure that alert is set to true as a default, since the anomaly was that a hangup_media was following a close_pc the very first time you sent an offerless INVITE, and that's wrong. This means I'll work on a fix in the SIP plugin itself.

I'll let you know when a patch is ready. I'll probably prepare a PR, so that more people can test the effect on other plugins as well.

lminiero commented 6 months ago

@Jonbeckas please test the PR above, which attempts the fix in a different way. It should address both scenarios you had problems with. You may want to test more, though, just to ensure nothing else breaks. Notice I also fixed the error code we send back by default when declining: for some reason it was 486 instead of 603.

Jonbeckas commented 6 months ago

The PR works for me.