Prevent runaway CPU requirements in kubernetes deployments

corneliusroemer commented 2 months ago

Our current deployment appears to be bistable: once CPU requirements go above a certain level which is surprisingly low, maybe 30-40%, the cluster enters a locked state where CPU usage goes to 100% and doesn't easily recover.

The reason, I think, involves argocd restarting failed containers (things like prepro/ingest fail when keycloak is down, we probably should not fail but log/error/notify - so as not to make the situation worse). Restarted containers require more CPU than running ones.

We should investigate what the major contributors to the runaway are. Possible mitigations:

simply make sure we don't exceed 30% CPU usage, limit preview deployments
notify if this locked state is detected (e.g. if CPU usage goes >80%), or if there are scheduling errors in events
add extra nodes
give production proper cpu request budgets so production is not affected by cpu starvation overall

corneliusroemer commented 1 month ago

This happened again just now in production when scaling down from 4 to 3 EC2 instances, causing 2 minutes production downtime (sorry!)

The instances were at ~10% CPU, yet the cluster still locked up. This is likely due to very high startup cost. We need to see how we can work around this - possibly with CPU limits.

corneliusroemer commented 1 month ago

For now, whenever we plan to reduce number of nodes, we must have a lot of spare nodes available. I.e. if we update kubernetes version, we need to first increase nodes to something like 6 or so, so that we have 5 rather than 3 to soak up the rescheduled pods.

theosanderson commented 6 days ago

(We now reserve resources for prod with hopefully largely mitigates this)

corneliusroemer commented 6 days ago

For prod yes we're fine as is (at expense of needing more resources), not necessarily for Hetzner CI/dev cluster where we are more resource constrained

We still have some bad patterns to fix: we restart a lot of pods if they get too slow, too quickly, things fail if something is too slow etc.

As long as we have headroom it's not an issue, but the positive feedback instability root cause is only worked around.

loculus-project / loculus

Prevent runaway CPU requirements in kubernetes deployments #2701