Liveness check may need to be desensitized a bit

typokign / matrix-chart

Helm chart for deploying a Matrix homeserver stack

MIT License

89 stars 48 forks source link

Liveness check may need to be desensitized a bit #30

Closed Routhinator closed 4 years ago

Routhinator commented 4 years ago

Now that I have my federation working thanks to the ingress improvements.. My container is crashlooping.. I couldn't figure out why at first but I realized that with all the work it's doing chatting with new federation partners the container is being killed by Kubernetes because it's not responsive enough:

  Warning  Unhealthy               22m (x4 over 23m)      kubelet, routhio-production-toronto-pool1-3u43e  Liveness probe failed: Get http://10.244.0.15:8008/_matrix/static/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It might need to be given a failure allowance for these circumstances.

typokign commented 4 years ago

Done

Oof, didn't realize the default was 1 second. That's a bit too low for Synapse. I've upped it to 5 seconds, but you may want to raise it/lower it further.

Routhinator commented 4 years ago

Yeah, I have been watching this more and you may also want to add

failureThreshold: 3

I think when the server is in a burst of http traffic from federating it may be throttling incoming connections, which causes the connect timeout. I'll try the update you made first though.

typokign commented 4 years ago

The default failureThreshold is already 3 (at least on minikube 1.18 where I'm testing right now). If your cloud provider has overriden it, feel free to change it back.

typokign commented 4 years ago

(Looks like it's also the official default in the API spec)

Routhinator commented 4 years ago

Oh, derp. Sorry. That was right in front of my face on the docs too. :P

Routhinator commented 4 years ago

Stable without a restart for 12 minutes so far. I think that's got it - thanks.