meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.11k stars 2.46k forks source link

[1.x] videoroom: remote publisher doesn't release RTP/RTCP ports #3345

Closed boddob closed 5 months ago

boddob commented 5 months ago

What version of Janus is this happening on? 1202 (commit: 3773de68a973020cc4670acf84be2f061104e5f8)

Have you tested a more recent version of Janus too? This is reproducible on the latest commit, we use 1200 (commit: e3a2aea1b1f7f2788c70b39ca8cb0642cbd1e7fa) for production on which we first saw it.

Was this working before? This issue has been there since we've started using it.

Is there a gdb or libasan trace of the issue? Don't have a trace, but have logs with refcount debug enabled, which seems to point to a refcount issue. Shared the gist below. https://gist.github.com/boddob/e5f82c19e664840083fc8f8f0f0979b7

The remote publisher refcount object to look at is "0xffff50001680". A cat of the gist content shows how the refcount seems to have some issue.

cat janus_refcount_gist.txt | grep 0xffff50001680 [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff50001680 (1) [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (2) [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (3) [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7313:increase] 0xffff50001680 (4) [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:12835:increase] 0xffff50001680 (5) [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2117) [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2116) [plugins/janus_videoroom.c:janus_videoroom_publisher_dereference:2387:decrease] 0xffff50001680 (2115) [plugins/janus_videoroom.c:janus_videoroom_publisher_destroy:2429:decrease] 0xffff50001680 (2114) [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:13131:decrease] 0xffff50001680 (2113)

Additional context

Background: We use the remote publishing/SFU cascading feature to support video rooms on multiple Janus servers. We have a separate orchestrator service on each server, that does the job of adding and removing remote publishers. The design was inspired from the SFU cascading blog post. These are EC2 servers running on aarch64 CPU(s).

Problem: Over time, the servers run out of free RTP/RTCP ports, and the request "add_remote_publisher" fails with the error code: 499, "Could not open UDP socket for RTP stream". This happens even though we ensure "remove_remote_publisher" is called for the same publisher once we're done.

Observations:

When testing this in an isolated environment, it was observed that the refcount for the remote publisher (initalized in line no. 7202 of the current src/plugin/janus_videoroom.c), doesn't behave as expected. At the point we try to remove the remote publisher, the refcount is a very high number, and the function "janus_videoroom_publisher_free" is never called. Interestingly, the refcount value seems to grow with time, i.e for how long the "remote publisher" was active and receiving data from another instance. I have shared the logs of this experiment, with refcounting enabled.

Our orchestrator that talks to Janus has multiple threads, and each one creates a different session, and each thread took up on requests/events. So, the session that adds the remote publisher could be different from the session that removes the remote publisher. Our initial suspicion was that, but the issue persists even if the orchestrator uses a single session/thread.

atoppi commented 5 months ago

Hi @boddob thanks for reporting. Could you try the attached patch? It's on top of current master but it's just the removal of one line, it should not be difficult to adapt it to your current commit.

remove_extra_ref_vr_remote_pub.txt

boddob commented 5 months ago

@atoppi thanks for sharing this. I'll give it a try and get back!

boddob commented 5 months ago

@atoppi I gave this a try and it works as expected now. I'm seeing that the sockets are released now. Thanks! Is there something more to test here to validate the change?

Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff6000f840 (1) Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff6000f840 (2) Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff6000f840 (3) Mar 19 10:46:48.480002 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7313:increase] 0xffff6000f840 (4) Mar 19 10:46:48.480002 : [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:12835:increase] 0xffff6000f840 (5) Mar 19 10:46:48.982088 : [plugins/janus_videoroom.c:janus_videoroom_handler:9473:increase] 0xffff6000f840 (6) Mar 19 10:46:49.004447 : [plugins/janus_videoroom.c:janus_videoroom_handler:9809:decrease] 0xffff6000f840 (5) Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff6000f840 (4) Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff6000f840 (3) Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_dereference:2387:decrease] 0xffff6000f840 (2) Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_destroy:2429:decrease] 0xffff6000f840 (1) Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:13130:decrease] 0xffff6000f840 (0)

atoppi commented 5 months ago

I guess due to the nature of the fix a quick test is enough but I'll let @lminiero reply in case he feels more testing is needed.

lminiero commented 5 months ago

If you can test it a few more days it will help, just to be sure. But I agree this is 99.99999% likely the cause of the issue.

boddob commented 5 months ago

@atoppi @lminiero thanks, I'll test for some more time and get back.

atoppi commented 5 months ago

@boddob the patch has been merged, thanks for testing.

boddob commented 5 months ago

@atoppi sorry for not getting back in time. I've been running this under heavy loads, and I've not faced any issues.

zevarito commented 4 months ago

I've deployed 2 servers with this patch, one about a week ago, and seems running fine, metrics doesn't show up improvements or regressions compared with the rest of the servers so far. Yesterday I've deployed a second one, that unfortunately crashed about 10hs later. Core dump core.hloop 2.1 was somehow truncated and I am not able to debug and post results. Just wanted to share this in case someone else is also facing a similar issue, for now I would prefer to no update the reset of the servers until have a better insight about the issue.

boddob commented 3 months ago

@zevarito I tested this patch for a week with the latest janus-gateway code at the time and hadn't seen any crash. For production, we have this running for over a month, but patched on an older janus-gateway version (1200, commit : e3a2aea1b1f7f2788c70b39ca8cb0642cbd1e7fa). Will update here if I see any issues.