jitsi / jicofo

JItsi COnference FOcus is a server side focus component used in Jitsi Meet conferences.
Apache License 2.0
314 stars 350 forks source link

Temporarily failing bridge healthcheck permanently leaves Jitsi without any operational bridges #1143

Open pbirkants opened 4 months ago

pbirkants commented 4 months ago

Description

Bridge link between Jicofo and JVB sometimes is terminated when host is under heavy load by other processes, but later never recovers, preventing any Jitsi calls from working until manually restarted.

Current behavior

If the healthcheck takes too long, the JVB node is dropped and never resumed, even though it's running fine.

Here are the relevant Jicofo and JVB logs from the time period, nothing else was recorded before or after this (until JVB was restarted manually).

Jicofo 2024-03-06 03:50:59.501 WARNING: [14] JvbDoctor$HealthCheckTask.doHealthCheck#189: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00] health-check timed out, but will give it another try after: 5000
Jicofo 2024-03-06 04:24:29.799 WARNING: [14] JvbDoctor$HealthCheckTask.doHealthCheck#240: Health check failed for: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00]: <error xmlns='jabber:client' type='cancel'><internal-server-error xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xml:lang='en'>Performing a health check took too long: PT3.512705S</text></error>
Jicofo 2024-03-06 04:24:29.836 INFO: [39] JvbDoctor.bridgeRemoved#105: Stopping health-check task for: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00]
JVB 2024-03-06 04:24:24.985 SEVERE: [25] HealthChecker.run#181: Health check failed in PT3.512705S: Result(success=false, hardFailure=true, responseCode=null, sticky=false, message=Performing a health check took too long: PT3.512705S)
JVB 2024-03-06 04:24:29.633 WARNING: [6243] XmppConnection.measureDelay#244: Took 171 ms to handle IQ: <iq xmlns='jabber:client' to='jvb@auth.**REDACTED**/-BUTIsDF' from='jvbbrewery@internal.auth.**REDACTED**/focus' id='**REDACTED**' type='get'><healthcheck xmlns='http://jitsi.org/protocol/healthcheck'/></iq>
JVB 2024-03-06 04:25:21.295 INFO: [25] HealthChecker.run#179: Performed a successful health check in PT0.000029S. Sticky failure: false

Expected Behavior

The bridge connection should be recovered automatically.

Steps to reproduce

Not sure how to reproduce this reliably, it has happened two or three times over several months during the night, when Jitsi is completely idle, but other processes running on the host are causing significant system load.

Environment details

All Jitsi components installed locally on a single server, with a single bridge used.

APT package versions (but this has happened with earlier versions, too):

jitsi-meet            2.0.9220-1
jitsi-videobridge2    2.3-67-gb2d4229f-1

damencho commented 4 months ago

Please, when you have questions or problems use the community forum before opening new issues, thank you.

damencho commented 4 months ago

You can disable the health checks to avoid the bridge being removed if you do not have multiple bridges and autoscaling.

https://github.com/jitsi/jicofo/blob/fb29dc88dd787353c1d994f66c8aea85caa12960/jicofo-selector/src/main/resources/reference.conf#L65

bgrozev commented 4 months ago

Jicofo fails to resume jvb health checks once they fail. This is fine in most environments where we use sticky-failures=true which is why we haven't noticed before.

pbirkants commented 4 months ago

Thank you for reopening this issue.

I'd like to add that I'm using the defaults for any related settings for both Jicofo and JVB, which, I believe, are sticky-failures=false.

Disabling health checks does not seem like a good solution, as that could make it difficult to detect when the bridge is actually down.

0ki commented 2 months ago

This bug affects me too. Is there currently a planned timeline for a fix?

doerofthedo commented 2 weeks ago

It would be wrong to handle this bug by turning off health checks. I see that nobody is assigned to solve this. I'd like to know if there will be some movement regarding this in the near future..