igniterealtime / openfire-pade-plugin

A plugin for Openfire that offers web-based unified communications - chat, groupchat, telephone, audio and video conferencing.
Apache License 2.0
58 stars 30 forks source link

pade audio/video conference call issue #366

Closed paull25 closed 2 years ago

paull25 commented 3 years ago

Hey, hope you guys are doin' fine. I got a single OF 4.6.4 set-up with latest pade installed. Audio/video conference works great in a max of 10 participants, when I add more participants, the conference call crashes then participants will auto rejoin with poor audio quality. Maybe i am missing configuration here to fine tune the call experience?

Thanks!

gjaekel commented 3 years ago

Dear @paull25, thank you using Openfire and the Pàdé plugin. There seems to be a resource problem. May you please provide a "clean" log of the startup (stop openfire, rotate all.log an start again) and the ongoing join of participants to the conference -- if possible -- until the "crash"?

Maybe the Java heap of the JVMs are to small or you're out of files/sockets on your platform. Do you use Linux?

paull25 commented 3 years ago

Hi @gjaekel ,

I appreciate that you look into my concern. I did configure OF 4.6.4 in a Linux 21.04 box with Java 11.0.11 and mysql 8.0.24.

Btw, is there a known limitation with the number of participants w/ both audio/video turned-on in a pade plugin conference call?

gjaekel commented 3 years ago

Thank you, will have a look on the log later.

Btw, is there a known limitation ...

There's a lot of rumor and false information about. In principle, there's no hard limit at all. Limitating factors in common are bandwidth and CPU power, but not only at the server side but also at the participant client(s), which have to decode and display al the video streams. In common, audio transmission is no problem at all. To limit this, there is the "Last-N" feature, which will "pause" all other video streams except the N last-used ones. At server side, you may also tune OS setting for the network stack, JVM settings.

At our institute, in the average we have more than 200 conferences a day, most with less than 5 participants, but also a bunch of bigger with about 20+. But I also participate events with about 70 participants. My OpenFire instance run with 20 Cores, but up to now the load don't reach event 10.

gjaekel commented 3 years ago

Sorry, but this all.log is not about the startup of Openfire and the Pàdé plugin. We're able to see a bunch of settings you choose from that.

At the log you send, there's a lot of warnings about very high processing delays. Something™ seems to be overloaded, maybe the JVM of the JVB. How much cores and memory the box provide, how much do you assign to the JVM's? May you share the output of jps -lv (while Openfire is running)

paull25 commented 3 years ago

Hi @gjaekel ,

Thanks for your patience. Glad to hear OF with Pàdé plugin works great in action.

My apology since I am new to OF. I collected the logs based on below steps:

gjaekel commented 3 years ago

For the output of jps, your JVMs for JiCoFo and JVB get 1GB Heap (-Xmx) and the JVM running OpenFire itself is unconfigured and should take 25% of the 16GB from that, i.e. 4GB.

Maybe the log was already rolled around because the current default of Openfire is still to roll by size? Then, this might be "to small" in comparison to the heavy logging. I don't remember because I re-configured the log4j2.xml ancient times ago in a way like ...

<Configuration monitorInterval="10">
    <Appenders>
        <!-- RollingFile name="debug-out" fileName="${sys:openfireHome}/logs/debug.log" filePattern="${sys:openfireHome}/logs/debug.log-%i" -->
        <RollingFile name="debug-out" fileName="${sys:openfireHome}/logs/debug.log" filePattern="${sys:openfireHome}/logs/debug.log.%d{yyyyMMdd}">
            <Policies>
                <TimeBasedTriggeringPolicy interval="1"/>
            </Policies>
            <PatternLayout>
                <Pattern>%d{yyyyMMdd-HHmmss.SSS} %-5p [%.32t] [%c{1.}] %msg%n</Pattern>
            </PatternLayout>
            <Filters>
                <ThresholdFilter level="DEBUG"/>
                <ThresholdFilter level="INFO" onMatch="DENY" onMismatch="NEUTRAL"/>
            </Filters>
        </RollingFile>
[...]

... to provide daily rotation. Note, that you may also let Log4J2 compress the archived logfiles here it the logfile pattern ends with .bz2or similar (, reffer to the docs). I do this "external" for special reasons (, i.e. a common log file archive platform).

Here my option settings for th JVBs (using Java-8 at this time). I break the line at the point to provide remote JMX access for our monitoring, you may leave this.

* JiCoFo:

-XX:+UseG1GC -Xms64m -Xmx256m -XX:MaxMetaspaceSize=128M -XX:MaxDirectMemorySize=64M -XX:MaxGCPauseMillis=50 -XX:ConcGCThreads=5 -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=5 -XX:ActiveProcessorCount=20 -XX:+UseStringDeduplication -Djava.io.tmpdir=/var/tmp -Djava.net.preferIPv4Stack=true -Dsun.net.inetaddr.ttl=60 -Dcom.sun.management.jmxremote.port=32199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.access.file=/opt/jmxremote/jmxremote.access -Dcom.sun.management.jmxremote.password.file=/opt/jmxremote/jmxremote.password.openfireP



Maybe 1GB for JVB is a bit to small. In the other hand, 4GB for _OpenFire_ is a lot if you use it for _Pàdé_ only. I'm using just 512MB (`-XX:+UseG1GC -Xms256m -Xmx512m -XX:MaxMetaspaceSize=128M -XX:MaxDirectMemorySize=512M ...)` here.
gjaekel commented 3 years ago

I study the log, again. It seems that you're loosing your participants because of "connectivity test" timeouts done by the JVB. The JVB then shut down the connections one by one.

The A/V streams are using UDP. Maybe the packets are dropped somewhere because of bottlenecks. Please tell me something about your network: Is this all local traffic? What's the bandwith of the interfaces and the backbone? What's the output of netstat -su, what's the IP stack settings for the core buffer? I'm using

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.netdev_max_backlog = 30000   
net.core.netdev_budget = 50000
net.core.netdev_budget_usecs = 5000    

Maybe you're out of sockets? What's the file descriptor limit for your technical openfire user (ulimit -a for that user)?

paull25 commented 3 years ago

Hi @gjaekel ,

Woahh, a mouthful of info, I need a breather! But hey, this is great and I really appreciate you giving deep technical info which you are always doing in this community.

Going back, you mentioned that "JVM running Openfire is unconfigured". How do I configure the JVM for Openfire? Reviewing these links to configure JVMs: https://discourse.igniterealtime.org/t/jvm-settings-and-debugging/49010 https://discourse.igniterealtime.org/t/increasing-java-memory/49582/6

For logs, I did not included all the logs found in ${sys:openfireHome} which is pointing to /var/log/openfire and /usr/share/openfire/logs . Might be the log files are renamed xx.log-1 and so on respectively each time I cleared them out of the Openfire gui. Btw, thanks for sharing your rotating log technic, I am taking note of that.

I did not mention that I logged-in 10 test accounts to join the conference call in a single computer to replicate my issue. That served to be the timeout/disconnection/bottleneck you saw in the log that I provided.

Also, monitoring plugin reported a peak server traffic everytime I encounter the conference call issue. This points me in taking a look in tuning up as well my OS network stack as all is pretty much by default.

gjaekel commented 3 years ago

The first link about configuring the JVM offer a lot of suggestion and informations, but you have to be familiar with Java and the JVM to understand it in deep, i think :smile:

Where to add this options depend on your Openfire-starter. Because we do a lot with Java and because I'm using Gentoo with OpenRC as Init system, I do this by my own. The JVM options are somewhat sorted by "importance". The first choose a modern garbage collector what wasn't the default on Java-8. The next explicit configure the heap sizes and -XX:MaxGCPauseMillis advises the GC to aim a certain maximal "Stop World" time span.

During certain times of (Full) Garbage Collection, a JVM have to suspend all threads and therefore will not process any data. For applications like the JVB, this must be avoided as much as possible because here the network buffers will be filled. Using GC1, a Full GC will be need very seldom on normal circumstances. But if the JVM run's near "out of heap", the GC will try to free long-unused objects on the main heap and just before a overflow, there is a forced full GC as the last resort.

It's a bit advantage, but you get very good life impressions of ongoing GC, if you use the (J)VisualVM tool and it's Visual GC-plugin.

gjaekel commented 3 years ago

What's your country / timezone? To my experience, such an issue is hard to catch by post-mortem information. As an option, we may meet for a "debugging session" via your or my OpenMeeting. You may contact me via email mentioned in the Github profile in that case. We may use even shared tmux session provided though ssh or the smart ttyd (https://github.com/tsl0922/ttyd).

paull25 commented 3 years ago

Hi @gjaekel ,

Sorry in getting back this late as I was out for 4 days. Anyway, I was able to configure my OF box based on your provided settings, apparently this has little to no effect atleast in my environment. The changes that I have made are as follows:

image

The network traffic in OF gui stats peaks high when I encounter the issue. image

This is my current OF environment. I have a single OF instance serving IM (spark client) and A/V conferencing (Pàdé) expecting 1000 users. Most users are accessing via VPN.

Are you running an OF cluster? To further troubleshoot, I am thinking to add more OF instances to split the network traffic/compute resource and check the perf from there.

Thanks again @gjaekel !

gjaekel commented 3 years ago

Dear @paull25, serving this number of user is a respectable usecase. No, I don't run a cluster and I don't know anybody (@deleolajide : Do you?). To my knowledge, clustering of OpenFire itself (i.e. the XMPP component) is well-introduced.

But the Jitsi components (Conference Focus and Video Bridge) use their own protocols. For a real high number/size of conferences, you will need a cluster of JVBs at first if you not able to scale the number of cores and the bandwith of the network on your current box.

All JVBs have to register at the JiCoFo and this component will control the spread the traffic. At the moment, at the Pàdé plugin there's no support to run more than one JVB on a local JVM. But it's no rocket science, because it "just" spawn an external JVM, sample and pass the console output to the OpenFire log.

In case of Unix, instead of a direct invocation, a shell script may span more than one JVB -- on further boxes, via ssh.

Again, please provide more information about the sizing of your platform. All we know yet, that it provide 4Core and 16MB RAM. And probably you run the database on the same box. I would estimate that you need about one Core per 10 participants in a Video conference, but much less for Audio. For this reason, you have to ask yourself how many of your 1000 users will use A/V-conferences at the same time in peak.

What's the network bandwitdh of you box, the internet connection and your VPN facility and firewall?

gjaekel commented 3 years ago

The file handle limit for your OpenFire user (at least, but probably for all) is a lot to low! Add something like

*       hard    nofile      65536
*       soft    nofile      65000

to /etc/security/limits.conf on servers for technical users. You may replace * (defining the default) by account names to assign individual limits.

gjaekel commented 3 years ago

Here some recent statistics of a day with two "real big" meetings. I myself participate the event at 13 o'clock, this was a lecture with 3 speakers and about 80 listeners. Most of the listeners had have Video muted, therefore this was a task to bridge about 20 to 80 Video streams. image

To my knowledge, the "unit" of "JVB stress" is said to be calibrated an typical "8Core cloud server". The peak at about 1.5 corresponds to an also monitored CPU load of 600% on my box.

paull25 commented 3 years ago

Hi @gjaekel ,

Again, thank you for your support. I have sent you an email to discuss further my current OF setup.

Thanks again!

deleolajide commented 2 years ago

I am closing this issue. My advice is to force VP9 codec in Pade and reduce bandwidth. Otherwise, try clustering with multiple Openfire servers with multiple Jitsi Video-bridges and a single focus user.