kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8k stars 3.94k forks source link

vpa-admission-controller: Wire contexts #6891

Closed ialidzhikov closed 1 month ago

ialidzhikov commented 4 months ago

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: 1.1.2

What k8s version are you using (kubectl version)?:

v1.29

What did you expect to happen?: Right now, the requests the vpa-admission-controller handles are not contextified. For example, for the handler for Pods, context.TODO() is used in few places where the admission-controller is making requests along the way:

Due to the usages of context.TODO(), when the caller (kube-apiserver) cancels the request (due to client side timeout), the admission-controller's Pod handler is not notified about this and continues to process the requests even when the request is cancelled client side.

We recently faced a VPA related outage (described in https://github.com/kubernetes/autoscaler/issues/6884) where the vpa-admission-controller was client-side throttled due to the low default kube-api-qps/burst settings.

From the logs we see that it was throttled > 50 minutes:

{"log":"Waited for 51m21.05416376s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}
{"log":"Waited for 51m21.024486679s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}
{"log":"Waited for 51m20.527328217s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}
{"log":"Waited for 51m19.975656855s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}
{"log":"Waited for 51m19.466347921s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}
{"log":"Waited for 51m18.572692764s due to client-side throttling, not priority and fairness, request: GET:https://kube-apiserver/apis/monitoring.coreos.com/v1/namespaces/foo/prometheuses/bar/scale","pid":"1","severity":"INFO","source":"request.go:697"}

Hence, the vpa-admission-controller currently would wait the client-side throttling (> 50min) instead of canceling the request.

Meanwhile the kube-apiserver cancelled the request after the configured timeout in the webhook (10s in our case):

E0604 13:21:25.379831       1 dispatcher.go:214] failed calling webhook "vpa.k8s.io": failed to call webhook: Post "https://vpa-webhook:443/?timeout=10s": context deadline exceeded

What happened instead?:

See above.

How to reproduce it (as minimally and precisely as possible):

Add a big sleep (higher than the kube-apiserver's timeout) in the Pod handler and make sure that the admission request continues to do things after kube-apiserver cancelled the request client-side.

Anything else we need to know?:

N/A

Shubham82 commented 4 months ago

/area vertical-pod-autoscaler

voelzmo commented 2 months ago

/triage accepted