Closed Half-Shot closed 1 year ago
The bodge eventually stopped working and we ditched the ping check in our fork. Not ideal, but basically the thing was crashlooping.
Additionally we had a health check on the service which meant that the pod would not see traffic until /health responded with a 200 value
That sounds like the real problem, there's no reason to have a health check like that 🤔
Faced the same issue. My workaround was to alter the container's start command to
- command:
- /bin/sh
- -c
- sleep 10 && /docker-run.sh
to allow Kubernetes networking to catch up, paired with entirely removing the readiness probe. Which obviously isn't ideal. But at least the livenessProbe
against /_matrix/mau/live
still works.
publishNotReadyAddresses
seems to be the way to tell kubernetes to not be slow
publishNotReadyAddresses
seems to be the way to tell kubernetes to not be slow
I can confirm this works on my deployment in k8s.
Can confirm as well. But IMHO it's not a clean solution and just a workaround.
Ideally, the existing endpoint /_matrix/mau/ready
(or whatever) could be used to detect, when the container is ready to accept traffic. Before sending traffic to it. That's the whole point of having a readiness endpoint, no?
The whole readiness concept doesn't really apply to services that can only have one instance at a time. It's not like that traffic could go anywhere else while it's not ready
It has a few retries now, although probably still won't work with the readiness check
We had an outage after updating the bridge today. The new ping check was being executed immediately after starting the HTTP listener for the bridge, which our Kubernetes stack was not reacting fast enough to. Synapse tried to ping the bridge, but since the service hadn't reacted to the pod starting it immediately failed and the bridge then fataled and closed.
We managed to fix this by bodging in a delay between the HTTP listener starting and the ping check, but that doesn't feel ideal.
Additionally we had a health check on the service which meant that the pod would not see traffic until /health responded with a 200 value. Since the ping check executes so quickly after the HTTP listener starts, there is virtually no chance for kube to check the health and enable routing of traffic.
I feel like there is a better answer than "delay" the ping, but ultimately I can't see kube managing to detect the service being up within milliseconds of it starting. It goes without saying that the present behaviour is extremely breaking, as previous versions of v0.8.0 didn't even require the bridge to route correctly on startup.