meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.17k stars 2.47k forks source link

Freeze in infinite loop when push event to core with jsep #1735

Closed lqp276 closed 5 years ago

lqp276 commented 5 years ago

Precondition: limit the ice port range, say, 20000 - 20002(just for reproduce problem) The dead loop is formed in below code:

janus.c: janus_plugin_handle_sdp

if(!updating && !janus_ice_is_full_trickle_enabled()) {
    /* Wait for candidates-done callback */
    while(ice_handle->cdone < 1) {
        if(ice_handle == NULL || janus_flags_is_set(&ice_handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_STOP)
                || janus_flags_is_set(&ice_handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_ALERT)) {
            JANUS_LOG(LOG_WARN, "[%"SCNu64"] Handle detached or PC closed, giving up...!\n", ice_handle ? ice_handle->handle_id : 0);
            janus_sdp_destroy(parsed_sdp);
            return NULL;
        }
        JANUS_LOG(LOG_VERB, "[%"SCNu64"] Waiting for candidates-done callback...\n", ice_handle->handle_id);
        g_usleep(10000);
        if(ice_handle->cdone < 0) {
            JANUS_LOG(LOG_ERR, "[%"SCNu64"] Error gathering candidates!\n", ice_handle->handle_id);
            janus_sdp_destroy(parsed_sdp);
            return NULL;
        }
    }
}

As per the code, to break out the while loop, there are two case:

  1. _icehandle->cdone >=1, this will happen when janus_ice_cb_candidate_gathering_done called
  2. _icehandle->cdone < 0, no code will set to this condition.

So if janus_ice_cb_candidate_gathering_done will not called(which will happen when all port ehausted and no more port for new candidate), the loop will wait forever.

I run into this problem when test the case with limited ports, and bypass this by add a loop counter, and check it alongside the ice_handle->cdone < 0:

    wait_count++;
    if(ice_handle->cdone < 0 || wait_count >= 10) {
        JANUS_LOG(LOG_ERR, "[%"SCNu64"] Error gathering candidates!\n", ice_handle->handle_id);
        janus_sdp_destroy(parsed_sdp);
        return NULL;
    }

Then check this error in plugin and handle the error accordingly,

lminiero commented 5 years ago

Thanks for reporting this. I'm currently on vacation, I'll fix this when I'm back.

atoppi commented 5 years ago

I think the best solution here would be to insert a candidate gathering timeout in ice.c and let the loop quit by using a termination flag like JANUS_ICE_HANDLE_WEBRTC_STOP or JANUS_ICE_HANDLE_WEBRTC_ALERT.

lminiero commented 5 years ago

IIRC we did fix this issue already. When ports are exhausted, nice_agent_gather_candidates fails, which we now intercept here:

https://github.com/meetecho/janus-gateway/blob/master/ice.c#L3384

Are you not getting that error message? What libnice version are you using?

lminiero commented 5 years ago

Just tried with two different scenarios:

EchoTest (Janus receives offer and sends answer)

[ERR] [ice.c:janus_ice_setup_local:3385] [8642029753275873] Error gathering candidates...
[ERR] [janus.c:janus_process_incoming_request:1259] Error setting ICE locally

Streaming (Janus generates offer and expects answer)

[ERR] [ice.c:janus_ice_setup_local:3385] [8014781341914643] Error gathering candidates...
[ERR] [janus.c:janus_plugin_handle_sdp:3197] [8014781341914643] Error setting ICE locally
[ERR] [janus.c:janus_plugin_push_event:3073] [8014781341914643] Cannot push event (JSON error: problem with the SDP)

In both cases I see what I expect to see: janus_ice_setup_local failing because there are no free ports (nice_agent_gather_candidates returns FALSE), and so no infinite loop. @lqp276 can you confirm you're on master, and let us know the version of libnice you're using?

As a side note, I do see some segfaults, though, probably caused by a missing reference increase somewhere, that causes the handle to be freed before it should. I'll investigate this.

lminiero commented 5 years ago

The above two commits fix it for me, and I can't replicate the issue. Closing, feel free to reopen if you can confirm it still happens for you on master, and if you can provide steps to replicate.

lqp276 commented 5 years ago

The bug was found on my local branch. The version I used is checked out on March 18,2019, and thereafter development was made base on it, and the code has diverged since then.

In the code I used, the check wasn't there, so the freeze happens. ice.c

3365     stream->component = component;
3366 #ifdef HAVE_PORTRANGE
3367     /* FIXME: libnice supports this since 0.1.0, but the 0.1.3 on Fedora fails with an undefined reference! */
3368     nice_agent_set_port_range(handle->agent, handle->stream_id, 1, rtp_range_min, rtp_range_max);
3369 #endif
3370     nice_agent_gather_candidates(handle->agent, handle->stream_id);

I then pull the code from github and test the master version, no freeze happens, so the check on value returned by nice_agent_gather_candidates fixed this bug.

I'm sorry for that I didn't noticed this has been fix on newest branch, next time I will test the master branch before open new issue.