meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.25k stars 2.48k forks source link

[1.x] Janus process killed after 15 participants, seems hloop problems #3200

Closed azimot closed 1 year ago

azimot commented 1 year ago

OS: ubuntu 20.04 update cpu : 10 cores Xeon(R) CPU E7-4850 v4 @ 2.10GHz ram : 16GB

Janus version : 1.1.1 libnice version : 0.1.17 libsrtp version : 2.2.0 usrsctp version : 0.9.5.0 libwebsockets version : 4.3.2 used transport : websocket

plugins: { disable = "libjanus_voicemail.so,libjanus_recordplay.so,libjanus_echotest.so,libjanus_nosip.so,libjanus_streaming.so,libjanus_videocall.so,libjanus_textroom.so" }

transports: { disable = "libjanus_rabbitmq.so,libjanus_pfunix.so" }

I'm using Videoroom plugin with 1 publisher and 15 subscribers all things was fine while I had just 15 subscribers. after 16th joined Janus crashed and killed by OS I just check processes and there are many hloop processes! more than 325 threads

logs and crash report are exist :

top with apport

if you see, there are more that 320 threads on hloop before apport process eaten CPU at first I think apport was buggy and this is why linux killed low priority processes additionally Janus too so I disable apport and then run a test again

process log without apport process

I checked thread limitation on linux but it's OK and enough

_opt_janus_bin_janus.0.crash

Is hloop related to libnice ? must be used newer libnice version ?

As I see there is no zombie. in normal usage, linux killed Janus process , any idea to resolve that or similar experiences ?

atoppi commented 1 year ago

The cause of crash is not clear since you did not share any backtrace from gdb or report from ASan.

hloop are the threads that Janus uses internally to handle media packets and PeerConnection related events. They are not created by libnice but are needed to handle the event loop in libnice though.

For such a small number of participants (16), the number of hloop should be way lower. A living hloop might indicate one of the followings:

We definitely need further data: 1) Your current janus.jcfg (to rule out a misconfiguration in event loops) 2) A gdb backtrace or ASan report (that will clarify the reason of the crash) 3) A Janus Admin API dump, in particular a list_sessions request and for each session returned a list_handles request, fetched before the crash (you can poll the API every X seconds, that is to understand if you are leaking resources on the server)

lminiero commented 1 year ago

Make also sure you didn't forget to increase the ulimit: https://janus.conf.meetecho.com/docs/FAQ.html#ulimit

atoppi commented 1 year ago

I was assuming you are using the VideoRoom in multistream mode, if that is not the case (e.g. you are using Janus 1.x with 0.x syntax), then that number of hloop threads might not be significant, since in a VR with 16 participants you might end up having up to 16^2 = 256 active handles and hloop threads. In that case, as Lorenzo mentioned, you are probably hitting a kernel limit.

azimot commented 1 year ago

@atoppi @lminiero exactly Lorenzo, that's reason is related to ulimit !!

my way that solved problem : ulimit -a show us almost all user limits defined by default in Linux cat /proc/$(pidof janus)/limits show us limits of Janus to use

temporary increase limits to see problem is solved : ulimit -n 60000 ulimit -Hn 600000 ulimit -Sn 60000 Note : ulimit -n is not enough ulimit -Sn and ulimit -Hn are two switches that we must be used and I assumed if we want to have set them permanently please edit limits.conf file : nano /etc/security/limits.conf

#ulimit -Hn 1048576 root hard nofile 1048576 #ulimit -Sn 60000 root soft nofile 60000

root is user that Janus running by. in most tutorials used * (star) to get all users but this is not working !! must be use exact user we want to set new configs.

after any reboot we can to see cat /proc/$(pidof janus)/limits is correct

Lorenzo, is not this problem popular ? 15-16 participants is regular usage by VideoRoom plugin and these configs are needed in main Janus installation tutorial in website and README.md too

problem solved but I'm pretty confused ! that problem is solved by ulimit -Sn but it's related to open files limit (disk I/O (read/write)) but hence I know about Janus architecture while we don't record streams on disk we should be not have huge open files .

why that happened ?!? and why Janus (without stream recording) needs to have more open files ??!

Thank You

atoppi commented 1 year ago

In Unix/Linux OS file descriptors are used for any kind of I/O resource:

If you need more details about the fds in use by Janus, check the lsof command.

Closing the issue now.