Client-side rate limiting in admission-controller can lead to endless eviction loops

voelzmo commented 3 months ago

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?:

Component version: v1.1.2

What k8s version are you using (kubectl version)?:

v1.29

What happened?:

There is a health checking mechanism in the vpa-updater which prevents evictions when the admission-controller is no longer healthy. The admission-controller runs a StatusUpdater and is expected to renew a lease. Lease renewal requests are done every 10s. The updater checks if the lease is recent enough (not older than 60s) and only then evicts Pods.

We just saw an incident where the admission-controller was configured with default values for client-side rate limiting and ended up being rate limited on retrieving the /scale subresource with consistent client-side throttling delays >> the configured webhook timeout in the kube-apiserver. During that time, the lease was still updated often enough, such that the vpa-updater still continued to evict Pods. Pods were created with their default resource settings, therefore we ended up having an endless eviction loop.

The fix was to increase client-side rate limiting configuration and keep a keen eye on the metrics for the Pod admission times.

What did you expect to happen instead?:

We should have a health check mechanism in place which prevents vpa-updater from evicting Pods if the admission-controller cannot do its job correctly, such that we avoid endless eviction loops that require manual intervention.

More context Potentially related is this discussion: Do we even want/need client-side throttling in the VPA? https://github.com/kubernetes/kubernetes/issues/111880

voelzmo commented 3 months ago

/area vertical-pod-autoscaler

raywainman commented 3 months ago

(I'll create a new issue for the problem below if needed but thought I'd add it here as mentioned in SIG Autoscaling today because I think the solution is likely similar)

One other vulnerable area of bad things happening due to health check + API limit is in the updater in the main loop here:

https://github.com/kubernetes/autoscaler/blob/e08681bd6f1367a6d7fc9131edead31a73dd7e4d/vertical-pod-autoscaler/pkg/updater/main.go#L125-L130

If the RunOnce() function takes too long fetching VPAs, Pods to be evicted, etc... and is unable to make progress with actually evicting any pods, the health check will happily kill the updater here and the cluster will get stuck and won't actually evict any pods.

A few thoughts:

Raising the API limit (or removing it) fixes this issue as well as the RunOnce() function can move faster and avoid getting throttled.
The health check here is a bit coarse, meaning it only updates once a full RunOnce() loop completes. Perhaps the health check should be passed into the loop so that it can be updated every time incremental progress is made? Otherwise you need to be careful that your healthcheck is properly tuned to not kill the updater while it is still working.
The context that is passed here is a bit misleading because it isn't actually evaluated until we evict a pod.

adrianmoisey commented 3 months ago

Do we even want/need client-side throttling in the VPA?

client-side throttling caused us issues in the past. The defaults were set to a very low value in the VPA, and when a large code deployment happened (spanning multiple Deployments), the admission-controller was very slow to respond, causing us to get failed deploys in our CD system.

I watched the sig-autoscaling recording discussing this, and I do agree that there are other potential failure cases to consider, and removing client-side throttling doesn't solve all of them.

But over all, I'm a +1 to removing the client-side throttling.

vlerenc commented 3 months ago

Client-side throttling may be one reason and if removed/curtailed, that may certainly help. But it's just one reason. Tomorrow something else may fail.

The health check that didn't/doesn't do its job right, is more important, because it covers more cases, but also that one is probably not a truly and "safe" solution. It isn't if it's just an "indirect proxy" for applying the right recommendations. What I mean: Ideally, the eviction part, because it relies on Kubernetes and web hooks afterwards and isn't in full control, should have some form of melt-down protection/circuit-breaker that terminates the eviction process. Sure, pods do not have a trackable identity, but evictions should only continue if pods come up with their recommended values. If either listing/watching the new pods fails or they come up with their initial/default requests instead of the recommended values, evictions should cease immediately as obviously something is off with the web hook. Relying on "indirect" indicators like the current health check (which is only a "proxy" and not the real deal here - the real deal is whether the new pods get their recommended values or not) is problematic/dangerous.

What we have experienced was a catastrophic failure: not only did the pods come up with super low initial requests and failed to do their job, but they were evicted continuously (infinite loop) without backoff or detection in VPA. VPA was running amok until humans/operators intervened. That's pretty concerning for a component that is used so widely and practically the de-facto standard for vertical pod scaling.

ialidzhikov commented 1 month ago

/assign

ialidzhikov commented 1 month ago

Today, I confirmed that the underlying rateLimiter used by the scaleClient and by the statusUpdater is the same. In a local setup I deployed VPA with the following diff under the vendor dir:

diff --git a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
index 850e57dae..7c2740a99 100644
--- a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
+++ b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
@@ -622,7 +622,7 @@ func (r *Request) tryThrottleWithInfo(ctx context.Context, retryInfo string) err
        case len(retryInfo) > 0:
                message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
        default:
-               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
+               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s, rateLimiter = %p", latency, r.verb, r.URL().String(), r.rateLimiter)
        }

        if latency > longThrottleLatency {

I reproduced the issue again the the logs reveal the same rateLimiter (0x400011b830):

I0711 12:59:39.180582       1 request.go:629] Waited for 998.800458ms due to client-side throttling, not priority and fairness, request: PUT:https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/vpa-admission-controller, rateLimiter = 0x400011b830
I0711 12:59:53.623347       1 request.go:697] Waited for 6.98511392s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/monitoring.coreos.com/v1/namespaces/shoot--foo--bar/prometheuses/shoot/scale, rateLimiter = 0x400011b830

I again looked into this area because during the incident on our side we had only 12 client-side throttling logs for leases for 12 hour period and there were 1000 client-side throttling logs for the scale subresource for 15min interval. It is still not clear to me why the diff in the the amount of client-side throttling is that big when both clients are using the same underlying rateLimiter.

Another thing that I tried today is to see if the underlying Informers' will be out of sync when client-side throttling occurs. If that was the case, my idea was to couple the Informer's HasSynced properly with the health endpoint (with the liveness && readiness probes) of the vpa-admission-controller. I reproduced client-side throttling for 4m+ but they Informers' HasSynced property was still true. It would be interesting what happens on client-side throttling that waits 10m+ because right now the informer's resync period is 10min. I suspect that then we could have the informers out of sync.

Long story short, this idea was based on:

Informer gets out of sync (eventually)
Probes for the vpa-admission-controller fail
vpa-admission-controller goes into CrashloopBackoff
=> the vpa-admission-controller's lease is not updated
=> vpa-updater stops evicting

It is also a common pattern in K8s controllers/webhooks to couple the has-synced property of the informers with the health endpoint of the component: for example, see https://github.com/gardener/gardener/blob/10d75787a132bf39c408b74372f5d8c045fa8f4b/cmd/gardener-admission-controller/app/app.go#L108-L110

In an internal discussion with @voelzmo and @plkokanov we discussed the option to introduce a ratio of failed vs success requests in the vpa-admission-controller. The vpa-admission-controller knows and can count when a request succeeds or fails. When then failure rate crosses a configured/configurable threshold, then the vpa-admission-controller can stop updating its lease. The doubts I have for this options are "what amount of requests we consider as minimum baseline". Because on 2 requests, 1 failure, 1 success, it wouldn't be ok to stop updating the lease.

ialidzhikov commented 1 month ago

Another way of fixing/mitigating this issue would be to contextify the statusUpdater runs: https://github.com/kubernetes/autoscaler/blob/8bc327e86899cae17b3d7e1170fc10d7be1221a0/vertical-pod-autoscaler/pkg/utils/status/status_updater.go#L48-L61

Right now, there is no timeout/context passed to the statusClient. It runs at every update interval (10s) but the run itself can take forever until it succeeds/fails. On client-side throttling it succeeds after minutes and the lease at the end gets renewed client-side throttling this happens at the end with huge delay.

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

WDYT?

ialidzhikov commented 1 month ago

I created https://github.com/kubernetes/autoscaler/pull/7036, let me know what you think.

adrianmoisey commented 1 month ago

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

This seems sane

voelzmo commented 1 month ago

/triage accepted

kubernetes / autoscaler

Client-side rate limiting in admission-controller can lead to endless eviction loops #6884