Operations MVP: ability to pause reconciliation to allow SRE/dev debug

relyt0925 commented 2 years ago

Hypershift operators must allow isolated control plane restarts and scaling. I've been battling operators for the ability to restart and scale select control plane components. Being able to do this is an important operations tool used to debug problems and fix clusters. There will be various cases over time operating a large scale offering where we will need to pause for a period of time to have control over various debugging actions. For example:

Things like potentially needing to scale down components in an outage scenario
Triggering a rolling restart of a specific component as part of a debug operation with a client
Potentially scaling down operators if there's a scenario where too much load is present on the cluster and we need to temporarily scale things down to recover from an "overloaded" situation. Overloaded in this context being aggregated load across all the operators put on the Kube-APIServer. Will provide an example of that with our perf tests here

Tangibly I believe this can be implemented with the ability to pause reconciliation on resources (both the hostedcluster and the hostedcontrolplane). The exact implementation can be discussed.

relyt0925 commented 2 years ago

An example from a recent perf test: Let's say we get to a point where we have an overloaded situation in the control plane. In a recent example of a perf testing: we tried to scale to 440 clusters and the maximum amount of node pools and it ended up overloading the API Server results shown below.

image (2) image (1)

In this situation: the management cluster's APIServers have just been repeatedly OOMing due to the load. It has been doing this for multiple hours. Using this scenario as an example in production: we would not be able to just leave things status quo: we would need to be able to pause and start scaling down specific components to help alleviate this load (potentially internal test clusters, abandoned clusters with no workers, etc). Currently those changes would just get stomped and then we would be back in this scenario.

We can provide more examples from an operations perspective as well cc @rtheis

rtheis commented 2 years ago

Hi folks,

There are times when we need to do an etcd recovery for a cluster that requires the entire control plane to be shutdown during the recovery. Shutting down the control plane is an easy way to ensure that we stop all requests to etcd during the recovery process.

We've also encountered times when we need to throttle Kubernetes API server requests to protect the cluster from a rogue application so we can recover the control plane. Throttling requires us to reconfigure the API server for a period of time while we recover the cluster.

And finally, this issue initially surfaced when I was trying to fix the control-plane-operator deployment because it was causing the pods to crash. I was only able to do this by stopping the Hypershift operator which won't be an option for us without causing a production outage for our service.

I can provide more details if needed. Please let me know. Thanks.

relyt0925 commented 2 years ago

/assign @relyt0925

relyt0925 commented 2 years ago

finished!

openshift / hypershift

Operations MVP: ability to pause reconciliation to allow SRE/dev debug #1046