meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.17k stars 2.47k forks source link

[1.x] Performance issue/improvement audiobridge #3394

Closed robby2016 closed 1 day ago

robby2016 commented 3 months ago

We recently ran into the issue that Janus is consuming a lot of resources even if the instance isn't used and also after a fresh restart. After some investigation we figured out that the problem is probably related to a lot of permanent audiobridge rooms (saved in the config file). As far as I understand it the issue is, that on startup Janus starts one "janus_audiobridge_mixer_thread" per created audiobridge room. Sadly I can't really provide a usable PR but maybe I can suggest to move the creation of "audiobridge->thread" into the "janus_audiobridge_handler" and create the mixer only when a user joins. Ideally you could also stop the mixer once the last participant leaves, but that can be solved for now by simply restarting Janus.

For testing I move the creation of the mixer here.

if (audiobridge->thread == NULL) {
        char myname[16];
        GError *myerror = NULL;
        audiobridge->thread = g_thread_try_new(myname, &janus_audiobridge_mixer_thread, audiobridge, &myerror);
}
lminiero commented 3 months ago

I doubt the problem is the thread. The problem is likely that you have something like "always on" set on the mixer, meaning it will work even when there's no one in. That's something you can disable, and basically already what you were planning to add.

robby2016 commented 3 months ago

Wow, thanks for the fast reply. I did see the "always_on" variable but thought that was only if you use rtp_forwarders. In our case the rooms don't use rtp forwarding and have a pretty minimal setup. Thank you I'll take a closer look at the "always_on" property.

e.g.


room-name :
{
  description = "Room desc";
  is_private = "yes";
  sampling_rate = "16000";
  secret = "123456";
  pin = "123456";
  audiolevel_ext = "yes";
  audiolevel_event = "yes";
  audio_active_packets = "50";
  audio_level_average = "50";
  record = "false";
};
atoppi commented 3 months ago

Could you please quantify both "a lot of resources" and "a lot of permanent audiobridge rooms" ?

The only thing that an idle mixer thread does is waking up every 5ms, checking every 15ms if there is some work to do:

https://github.com/meetecho/janus-gateway/blob/2a1db57ef737a272faa8fc433fce5266603a4737/src/plugins/janus_audiobridge.c#L8185-L8217

I'd not expect a high usage of resources for this bunch of operations, even though this is not the best design in terms of efficiency when dealing with your scenario (many idle static rooms).

robby2016 commented 3 months ago

Sorry, you're right I should have given some more information about the setup and performance. On a small testing VPC (2 Cores) with 400 rooms configured, causing a load of around 5, basically making the system unusable. On a much larger system (dedicated) with 64 Cores I have tested it with around 5000 rooms, causing a load of around 10.

atoppi commented 3 months ago

Confirmed the behavior. On my 16 cores machine, 10k idle rooms basically melt the CPU.

Prepared a flame graph with 2500 bridges: perf-1718206307

The issue seems to be the scheduling of the tasks due to the huge amount of sleep calls.

lminiero commented 3 months ago

I'd argue that this is not an issue. The AudioBridge implements a mixer, and hundreds or thousands of idle ones make little sense to me, and load is to be expected. At any rate, the fix would not IMHO be moving when you create the thread: the moment someone gets in and leaves a second later, and you're back to square one. Maybe this could be solved using conditions and signals, but I'm not sure, as I haven't given this enough thought. That said, to be honest this has a low priority for me: if you're willing to prepare a PR that doesn't involve moving the thread creation, I'd be happy to review it :v:

robby2016 commented 3 months ago

First let me say thanks for the great help. I do understand, that this is maybe more of an "edge" case and doesn't have a high priority. Maybe we could mitigate the problem accordingly. Let's leave the threads alone and simply adjust the sleeping time of the mixer based on usage?

Something like this maybe?

lminiero commented 3 weeks ago

@robby2016 apologies for the late answer. I don't think adjusting the sleep time would help, as it would make rooms poorly responsive in those transitions. I personally still think conditions/signals would be the way to go. If you're willing to explore that road in a PR, please do let us know and we'll review it.

robby2016 commented 1 day ago

@lminiero Agreed, a better solution would be something that would use conditions or signals somehow. I'll look into that. Thanks again for the great help :)