mautrix / whatsapp

A Matrix-WhatsApp puppeting bridge
https://maunium.net/go/mautrix-whatsapp
GNU Affero General Public License v3.0
1.26k stars 172 forks source link

Ping check can execute too quickly and cause the bridge to fail to start #621

Closed Half-Shot closed 1 year ago

Half-Shot commented 1 year ago

We had an outage after updating the bridge today. The new ping check was being executed immediately after starting the HTTP listener for the bridge, which our Kubernetes stack was not reacting fast enough to. Synapse tried to ping the bridge, but since the service hadn't reacted to the pod starting it immediately failed and the bridge then fataled and closed.

We managed to fix this by bodging in a delay between the HTTP listener starting and the ping check, but that doesn't feel ideal.

Additionally we had a health check on the service which meant that the pod would not see traffic until /health responded with a 200 value. Since the ping check executes so quickly after the HTTP listener starts, there is virtually no chance for kube to check the health and enable routing of traffic.

I feel like there is a better answer than "delay" the ping, but ultimately I can't see kube managing to detect the service being up within milliseconds of it starting. It goes without saying that the present behaviour is extremely breaking, as previous versions of v0.8.0 didn't even require the bridge to route correctly on startup.

Half-Shot commented 1 year ago

The bodge eventually stopped working and we ditched the ping check in our fork. Not ideal, but basically the thing was crashlooping.

tulir commented 1 year ago

Additionally we had a health check on the service which meant that the pod would not see traffic until /health responded with a 200 value

That sounds like the real problem, there's no reason to have a health check like that 🤔

tamcore commented 1 year ago

Faced the same issue. My workaround was to alter the container's start command to

  - command:
    - /bin/sh
    - -c
    - sleep 10 && /docker-run.sh

to allow Kubernetes networking to catch up, paired with entirely removing the readiness probe. Which obviously isn't ideal. But at least the livenessProbe against /_matrix/mau/live still works.

tulir commented 1 year ago

publishNotReadyAddresses seems to be the way to tell kubernetes to not be slow

gcarrarom commented 1 year ago

publishNotReadyAddresses seems to be the way to tell kubernetes to not be slow

I can confirm this works on my deployment in k8s.

tamcore commented 1 year ago

Can confirm as well. But IMHO it's not a clean solution and just a workaround.

Ideally, the existing endpoint /_matrix/mau/ready (or whatever) could be used to detect, when the container is ready to accept traffic. Before sending traffic to it. That's the whole point of having a readiness endpoint, no?

tulir commented 1 year ago

The whole readiness concept doesn't really apply to services that can only have one instance at a time. It's not like that traffic could go anywhere else while it's not ready

tulir commented 1 year ago

It has a few retries now, although probably still won't work with the readiness check