Open corneliusroemer opened 2 months ago
This happened again just now in production when scaling down from 4 to 3 EC2 instances, causing 2 minutes production downtime (sorry!)
The instances were at ~10% CPU, yet the cluster still locked up. This is likely due to very high startup cost. We need to see how we can work around this - possibly with CPU limits.
For now, whenever we plan to reduce number of nodes, we must have a lot of spare nodes available. I.e. if we update kubernetes version, we need to first increase nodes to something like 6 or so, so that we have 5 rather than 3 to soak up the rescheduled pods.
(We now reserve resources for prod with hopefully largely mitigates this)
For prod yes we're fine as is (at expense of needing more resources), not necessarily for Hetzner CI/dev cluster where we are more resource constrained
We still have some bad patterns to fix: we restart a lot of pods if they get too slow, too quickly, things fail if something is too slow etc.
As long as we have headroom it's not an issue, but the positive feedback instability root cause is only worked around.
Our current deployment appears to be bistable: once CPU requirements go above a certain level which is surprisingly low, maybe 30-40%, the cluster enters a locked state where CPU usage goes to 100% and doesn't easily recover.
The reason, I think, involves argocd restarting failed containers (things like prepro/ingest fail when keycloak is down, we probably should not fail but log/error/notify - so as not to make the situation worse). Restarted containers require more CPU than running ones.
We should investigate what the major contributors to the runaway are. Possible mitigations: