kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.05k stars 3.97k forks source link

Missing status subresource when using custom VPA recommender with GKE native VPA setup #6828

Closed FrancoisPoinsot closed 5 months ago

FrancoisPoinsot commented 5 months ago

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: 1.1.1

What k8s version are you using (kubectl version)?:

kubectl version Output
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12", GitCommit:"12031002905c0410706974560cbdf2dad9278919", GitTreeState:"clean", BuildDate:"2024-03-15T02:15:31Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12-gke.1115000", GitCommit:"885327b7f1bebce409c843425b4688e3eeed33f4", GitTreeState:"clean", BuildDate:"2024-03-28T09:16:53Z", GoVersion:"go1.21.8 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}  

What environment is this in?:

GKE

What did you expect to happen?:

I expected I could deploy an instance of a custom recommender and it would interfaces nicely with everything else that GKE deploys natively for the VPA.

What happened instead?:

When deploying a custom recommender in GKE with GKE's VPA enabled, the custom recommender failed when attempting to update the status subresource:

recommender.go:128] Cannot update VPA test-francois/test-francois object. Reason: verticalpodautoscalers.autoscaling.k8s.io "test-francois" not found

How to reproduce it (as minimally and precisely as possible):

  1. have a GKE cluster with GKE's VPA enabled
  2. deploy only a vpa recommender (so without the updater or admissionController) with --recommender-name, with it's service account and permissions. I used cowboysysops's helm chart as base.
  3. deploy only the CRD that GKE has not deployed: vpaCheckpoints

Anything else we need to know?:

I figured there is a very small difference between the VerticalPodAutoscaler CRD that GKE deploys and the one available in this repo. GKE's: spec.subresources: {} vs: spec.subresources.status: {}

And indeed editing the CRD to add status field in subresources solves the problem.


But here is the issue. I wanted to:

Editing the CRD deployed by GKE sounds unreliable to me, as there is a risk it will be reverted later.

Am I missing some simpler way to deploy a custom recommender in GKE? Or is there a more reliable way to update the CRD that would have no risk to be reverted?

It doesn't seem obvious to me why this CRD change cause this issue though. Because using GKE's VPA, there will be a status eventually set in each VPA objects. So it looks like status declaration in the subresource shouldn't be mandatory

FrancoisPoinsot commented 5 months ago

I understand well that the community here is not responsible for GKE's implementation. I am not asking for adding status in GKE's version of the CRD.

I am asking for either guidance, or maybe a fix in-code, if my assumption about "status field shouldn't be mandatory" is true.

FrancoisPoinsot commented 5 months ago

I think the VPA CRD deployed by GKE is just an older version Probably that one: https://github.com/kubernetes/autoscaler/blame/b7d68c05248fed09bd0758759f70293b104f43ca/vertical-pod-autoscaler/deploy/vpa-v1-crd-gen.yaml

voelzmo commented 5 months ago

Right, the /status subresource was introduced with vpa-1.0. If GKE's CRD doesn't have this, it seems it is based on an older version. For your own deployment, you could switch back to vpa 0.14 then, which is the version right before 1.0.

FrancoisPoinsot commented 5 months ago

For posterity, I have confirmed that any upgrade of the GKE cluster revert the CRD to its original form. So editing the CRD is definitly a bad idea.

FrancoisPoinsot commented 5 months ago

I confirm that I could get a working custom recommender using GKE's VPA, doing the following:

I am going to close this issue as I don't see anything that can be done on VPA project side. The only sane thing to do would be to update the CRD deployed by GCP. But I don't know any good channel where to publish such request.

marevers commented 5 months ago

@FrancoisPoinsot I know this is closed already, but just as an FYI: I experienced this same issue on AKS (both 1.27 and 1.29). The fix you proposed to use the 0.14.0 image rather than 1.0.0 fixed it for me too.

FrancoisPoinsot commented 5 months ago

I am currently talking with GCP to see if this CRD deployed there can be upgraded. Fairly sure if it ends up happening this will not be only for my clusters, but everyone.

I hadn't faced that issue in Azure, because I am not relying on the native VPA feature there. Thanks for the added info that the same is going on there.

Also: the credit for the fix goes to @voelzmo

FrancoisPoinsot commented 4 months ago

For GKE, here is the public issue tracker that got created as a result: https://issuetracker.google.com/issues/345166946