vpa: configurable MutatingWebhookConfig `failurePolicy`

samcday commented 2 months ago

Which component are you using?: vertical-pod-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

I am operating an autoscaling cloud cluster where the k8s control-plane and VPA components run in a parent cluster. I have configured the kube-scheduler for this cloud cluster with a MostAllocated scoring strategy, in order to densely pack my workloads into as few cloud nodes as possible.

~~Because I'm lazy~~ Because memory is typically the resource under pressure for my use case, and because the memory usage of many workloads can often change depending on time of day and/or over the course of many days, I don't bother setting any resource memory requests. Instead I rely on VPA to determine and set this.

The issue is, in the event that the VPA admission controller was not able to handle a request when a Pod was scheduled, a Pod with no (or incorrect) resource requests is scheduled. In extreme cases, for example if I've just done a cold-start of the cloud cluster (from 0 nodes to >0), something of a "thundering herd" of a dozen workloads that typically need 4GB of ram get scheduled on a single node (MostAllocated scheduling strat, remember) that has 8GB of RAM. Predictably tragic results follow.

Describe the solution you'd like.:

I would like to be able to run vpa-admission-controller with --webhook-failure-policy=Fail, since I know that I have scheduled 2 vpa-a-c Pods with a PDB. So I would prefer that in the unlikely event no vpa-a-c is available, no pods are permitted to be scheduled in the dependent cloud cluster.

Describe any alternative solutions you've considered.:

I could start managing the MutatingWebhookConfig myself and configure vpa-admission-controller with --register-webhook=false. (I was actually slightly surprised that the cowboysysop VPA chart was not already doing this... but, I digest)

adrianmoisey commented 2 months ago

/area vertical-pod-autoscaler

adrianmoisey commented 2 months ago

I'm fine with this idea. It does allow for a foot gun though, where the VPA may fail on a VPA related pod. But these can be mitigated by using the feature to enable to ignore certain namespaces.

If others agree, I'm happy to build this feature

adrianmoisey commented 2 months ago

cc @voelzmo @kwiesmueller @raywainman

samcday commented 2 months ago

I'm fine with this idea. It does allow for a foot gun though, where the VPA may fail on a VPA related pod. But these can be mitigated by using the feature to enable to ignore certain namespaces.

Ah, hmm, good point :sweat_smile:

MutatingWebhookConfigs also support an objectSelector which could be used to filter out the vpa-admission-controller pods. That would mean vpa-a-c pods need to know about their own labels though...

FWIW, this specific issue wouldn't affect my particular use case, because I'm deploying VPA components in cluster A, but they're configured with a --kubeconfig pointing at cluster B. So I think it might also be reasonable for an initial implementation to just ignore the problem and add a bit of documentation that warns of this pitfall.

adrianmoisey commented 2 months ago

MutatingWebhookConfigs also support an objectSelector which could be used to filter out the vpa-admission-controller pods. That would mean vpa-a-c pods need to know about their own labels though...

I recently added a way to ignore certain namespaces, see https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.2.0/vertical-pod-autoscaler/pkg/admission-controller/main.go#L80-L81

I was thinking adding documentation warning anyone that if they used --webhook-failure-policy=Fail, they need to be careful and possibly ignore the VPA namespace too, using --ignored-vpa-object-namespaces

samcday commented 2 months ago

I recently added a way to ignore certain namespaces, see https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.2.0/vertical-pod-autoscaler/pkg/admission-controller/main.go#L80-L81

Ah-ha. I didn't realize this was already added. An unvoiced concern from my previous message was that I didn't want to impose more work by adding such a flag. But it already exists. And so ...

I was thinking adding documentation warning anyone that if they used --webhook-failure-policy=Fail, they need to be careful and possibly ignore the VPA namespace too, using --ignored-vpa-object-namespaces

... Perfect :+1:

adrianmoisey commented 2 months ago

/assign

raywainman commented 2 months ago

There's a few things I'd like to clarify:

Here we are concerned about the case when no vpa-admission-controller replicas are available, correct? In this case, we want all non-system pods (like vpa-admission-controller) to fail the webhook but still allow system pods to come up (effectively by bypassing the hook).
What is the expected behavior after the pods fail the webhook? Would they retry at some point or do they just sit in a bad state?
As a follow up to 2 - there are a variety of reasons the webhook can fail, for example when API Server is overloaded, do we really want to stop a pod from getting scheduled in that case?

samcday commented 2 months ago

Here we are concerned about the case when no vpa-admission-controller replicas are available, correct?

Right.

In this case, we want all non-system pods (like vpa-admission-controller) to fail the webhook but still allow system pods to come up (effectively by bypassing the hook).

Potentially, no. Since vpa-a-c itself only really needs an API Server connection to read out VPA .status.recommendations, you can make the case that you only care about filtering out the mutating webhook config for just this particular vpa-a-c pod(s). It's just that this is slightly tricky to do so the easiest way is to filter out a whole namespace (and just put the VPA components in there).

What is the expected behavior after the pods fail the webhook? Would they retry at some point or do they just sit in a bad state?

Depends who's scheduling the Pod. In the case of k8s-c-m ReplicaSet controlled Pods, there's retry with backoff I believe? (Plus the failures are noted as Events on the RS).

As a follow up to 2 - there are a variety of reasons the webhook can fail, for example when API Server is overloaded, do we really want to stop a pod from getting scheduled in that case?

By default, absolutely not. So the current behaviour of Ignore should be the default with this new flag. Only when operators are sure of their advanced use case should they reach for this flag and set it to Fail. One such use case is the example I outlined in the original issue description.

raywainman commented 2 months ago

Got it - thanks! Seems reasonable to introduce this as a "power" feature with a bunch of warnings.

@voelzmo - curious what you think?

samcday commented 2 months ago

Thanks @adrianmoisey :cookie: :cookie: :cookie: I'll keep an eye out for the next VPA release and look forward to making use of this.

voelzmo commented 1 month ago

Seems like I missed this interesting option during my vacation. Just following up on this now. @samcday out of curiosity: what's your approach to the question from @raywainman above?

What is the expected behavior after the pods fail the webhook? Would they retry at some point or do they just sit in a bad state?

So Pods couldn't get scheduled because the webhook failed – what happens next in your case where the cluster wakes up from hibernation?

samcday commented 1 month ago

@voelzmo in my specific case, this cannot happen. I have two distinct Kubernetes clusters, let's call them cluster Boop and cluster Meep. The k8s control plane for Meep is running on Boop. The only visible k8s Node objects in Meep are the cloud instances that are autoscaled (by cluster-autoscaler running on Boop). Similarly to the k8s-cp + cluster-autoscaler, I am running the VPA components for Meep on Boop. Thus, there's no "bootstrapping" or "chicken/egg" problem here.

That said, this issue can also be worked around in a single cluster by using the recently-introduced --ignored-vpa-object-namespaces option. In this setup, the VPA components live in their own namespace. let's call it vpa-system, and the admission webhook is configured with --ignored-vpa-object-namespaces=vpa-system. This guarantees that the webhook does not run for the pod(s) that power said webhook. This makes it feasible to use the Fail webhook policy to guarantee that pods cannot schedule without VPA recommendations applied, with the caveat/tradeoff that the VPA components themselves cannot have VPA recommendations at all.

samcday commented 1 month ago

I can't help but wonder if maybe this is something that could be improved in apiserver. Specifically, it'd be cool if admission webhooks could be configured with a Fail policy, but apiserver knows to relax this back to Ignore specifically for the pods that match the service selector the admission webhook was configured with. That would break the (admittedly somewhat niche) bootstrapping issue we've been discussing here.

If I get some agreement from folks in this thread that such a thing sounds useful I'll probably be motivated enough to open an issue upstream ;)

kubernetes / autoscaler

vpa: configurable MutatingWebhookConfig `failurePolicy` #7180