VPA restarts my pods but does not modify CPU or memory settings

liugb1029 commented 2 years ago

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?: 0.10.0

Component version:

What k8s version are you using (kubectl version)?:

[root@mc8k516pe4mve7s1jroc0 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.12", GitCommit:"4bf2e32bb2b9fdeea19ff7cdc1fb51fb295ec407", GitTreeState:"clean", BuildDate:"2021-10-27T17:12:26Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.12", GitCommit:"4bf2e32bb2b9fdeea19ff7cdc1fb51fb295ec407", GitTreeState:"clean", BuildDate:"2021-10-27T17:07:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

kubectl version Output

$ kubectl version

What environment is this in?:

[root@mc8k516pe4mve7s1jroc0 ~]# kubectl get pod
NAME                       READY   STATUS    RESTARTS   AGE
hamster-7844f5fbd8-8jw84   1/1     Running   0          47s
hamster-7844f5fbd8-98qrz   1/1     Running   0          2m47s
[root@mc8k516pe4mve7s1jroc0 ~]# kubectl -n kube-system get pod |grep vpa
vpa-admission-controller-79858b644f-tfq5x       1/1     Running   0          36m
vpa-recommender-6b86988ffd-gcsk6                1/1     Running   0          42m
vpa-updater-7ddcc6fc9f-6s6cw                    1/1     Running   0          42m
[root@mc8k516pe4mve7s1jroc0 ~]# kubectl describe vpa hamster-vpa
Name:         hamster-vpa
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  autoscaling.k8s.io/v1
Kind:         VerticalPodAutoscaler
Metadata:
  Creation Timestamp:  2022-03-14T06:03:51Z
  Generation:          42
  Managed Fields:
    API Version:  autoscaling.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:resourcePolicy:
          .:
          f:containerPolicies:
        f:targetRef:
          .:
          f:apiVersion:
          f:kind:
          f:name:
        f:updatePolicy:
          .:
          f:updateMode:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-03-14T06:03:51Z
    API Version:  autoscaling.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:recommendation:
          .:
          f:containerRecommendations:
    Manager:         recommender
    Operation:       Update
    Time:            2022-03-14T06:04:43Z
  Resource Version:  1652258
  UID:               62fa448b-bbe7-403e-bc95-6279ff212d2a
Spec:
  Resource Policy:
    Container Policies:
      Container Name:  *
      Controlled Resources:
        cpu
        memory
      Max Allowed:
        Cpu:     1
        Memory:  500Mi
      Min Allowed:
        Cpu:     100m
        Memory:  50Mi
  Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         hamster
  Update Policy:
    Update Mode:  Auto
Status:
  Conditions:
    Last Transition Time:  2022-03-14T06:04:43Z
    Status:                True
    Type:                  RecommendationProvided
  Recommendation:
    Container Recommendations:
      Container Name:  hamster
      Lower Bound:
        Cpu:     547m
        Memory:  262144k
      Target:
        Cpu:     587m
        Memory:  262144k
      Uncapped Target:
        Cpu:     587m
        Memory:  262144k
      Upper Bound:
        Cpu:     1
        Memory:  415402439
Events:          <none>

[root@mc8k516pe4mve7s1jroc0 ~]# kubectl describe mutatingWebhookConfiguration vpa-webhook-config
Name:         vpa-webhook-config
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  admissionregistration.k8s.io/v1
Kind:         MutatingWebhookConfiguration
Metadata:
  Creation Timestamp:  2022-03-14T06:09:23Z
  Generation:          1
  Managed Fields:
    API Version:  admissionregistration.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:webhooks:
        .:
        k:{"name":"vpa.k8s.io"}:
          .:
          f:admissionReviewVersions:
          f:clientConfig:
            .:
            f:caBundle:
            f:service:
              .:
              f:name:
              f:namespace:
              f:port:
          f:failurePolicy:
          f:matchPolicy:
          f:name:
          f:namespaceSelector:
          f:objectSelector:
          f:reinvocationPolicy:
          f:rules:
          f:sideEffects:
          f:timeoutSeconds:
    Manager:         admission-controller
    Operation:       Update
    Time:            2022-03-14T06:09:23Z
  Resource Version:  1642069
  UID:               80b7be1d-02e0-4fc7-b3a9-8ba5bcf6d4f2
Webhooks:
  Admission Review Versions:
    v1
  Client Config:
    Ca Bundle:  LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCekNDQWUrZ0F3SUJBZ0lKQU1DNnlRN2JEQ3RPTUEwR0NTcUdTSWIzRFFFQkN3VUFNQmt4RnpBVkJnTlYKQkFNTURuWndZVjkzWldKb2IyOXJYMk5oTUNBWERUSXlNRE14TkRBMk1ETXpPVm9ZRHpJeU9UVXhNakk0TURZdwpNek01V2pBWk1SY3dGUVlEVlFRRERBNTJjR0ZmZDJWaWFHOXZhMTlqWVRDQ0FTSXdEUVlKS29aSWh2Y05BUUVCCkJRQURnZ0VQQURDQ0FRb0NnZ0VCQUxVM1ZoNmdlUFZ4RmNuT1JMRE90anBEWk9QWUZiSUQybVBWdDhzYzVsd3YKeHlWeGZ5VXkwNzFWQnBBZVlzbUdSQ2FpVE1lNDNBaGx0NzZkWGF3d3htdVlWYWYzTFFNRjQzaktCdHpNZitOTwpRUno3MjBCSmNiakRaamhZRDYrZjdWL2U3NEhtR3FCK2tqT2tLSjk2Qy9IbVFyTk5scm1ySnQyb0UreGtoQlZuCnlJUXNnT2I1dlVGQmVvcjRNVjNXMTkrRHJXQXdnMXo1SlZvTE9jSHJuTUFDbUIyV1FIczIrc0R0czlSR21UcEwKcEhtM2lDaUY5RXB6ejVldkxISUhvMXFtZ0I0S1crN2tyd09ucG14K3lZbFpHd3dTRTRXT1ZJSjBDMnQ3MkNqbwprQ09MYVVnS0xOMklxOWJNSGFBVFRMNE02VzBqRzE2K3Z2VXlDQWYvbnpNQ0F3RUFBYU5RTUU0d0hRWURWUjBPCkJCWUVGQjN3M2ZwajBOY0x4Tkg2Q1hqVmNsNXhRQ0RCTUI4R0ExVWRJd1FZTUJhQUZCM3czZnBqME5jTHhOSDYKQ1hqVmNsNXhRQ0RCTUF3R0ExVWRFd1FGTUFNQkFmOHdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBQnl5VTN0YwpYeEpXTjRIYXNJT2xyU280YzZPY3NoWWpWMlpnMkc5TXZLZFpxT2hBOVdWWWhuUEJORmxGelBscHJKd1VVTy9PCkR5eE0xTzdTQXFFd1l0MXJNeXlzRjdlazJscmVRVnlIb2JlQTFmVDBNdDZtdTVJeGZzRi9YWkNOMFN4NWxPNzkKeklZeHBINTFFVEFVekR3aVY5bTVoUHZxeENZMjFJRHNURUVpUnZaeDc1ajJZazc2N092QXVBWHpUUzNtNEJlbQpkYzNKVDlFNkEzQW9wSWhTbWQzalg1NHdVM1lTT2luV09uWTVnbG15U0UzSnNwMjByRm54ZFFUUXluMTdkcWk3CmxJRWxLbjZXaThZOEswWU1Na3hkRHQzZHFnaVY3anFJOXhsWlpSdFJCS3NUazJqK1FqRjNrOVBJYTFoaVBsamYKNlp6NVA3NUZhd0tGVTFNPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    Service:
      Name:        vpa-webhook
      Namespace:   kube-system
      Port:        443
  Failure Policy:  Ignore
  Match Policy:    Equivalent
  Name:            vpa.k8s.io
  Namespace Selector:
  Object Selector:
  Reinvocation Policy:  Never
  Rules:
    API Groups:

    API Versions:
      v1
    Operations:
      CREATE
    Resources:
      pods
    Scope:  *
    API Groups:
      autoscaling.k8s.io
    API Versions:
      *
    Operations:
      CREATE
      UPDATE
    Resources:
      verticalpodautoscalers
    Scope:          *
  Side Effects:     None
  Timeout Seconds:  30
Events:             <none>

What did you expect to happen?:

What happened instead?:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

jbartosik commented 2 years ago

Can you give me some more information?

I can see that pods are called hamster. Does it mean you're running vpa-full tests and you're seeing the problem in those tests?

demikl commented 2 years ago

Same observation here. I have a dozen of VPA in my cluster, some of them are in Auto update mode. On one VPA (targeted to a daemonset for a Datadog agent that has 4 containers), pods are being evicted by the VPA updater but the mutating webhook does not patch anything.

There was no such behavior with v0.9.2

demikl commented 2 years ago

I had some time to gather some data. Let me know if you need me to investigate further.

VPA object:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: datadog
  namespace: monitoring
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: trace-agent
      maxAllowed:
        cpu: 700m
        memory: 700Mi
      minAllowed:
        cpu: 100m
        memory: 100Mi
    - containerName: process-agent
      maxAllowed:
        cpu: 700m
        memory: 700Mi
      minAllowed:
        cpu: 100m
        memory: 100Mi
    - containerName: agent
      maxAllowed:
        cpu: 700m
        memory: 700Mi
      minAllowed:
        cpu: 100m
        memory: 100Mi
  targetRef:
    apiVersion: apps/v1
    kind: DaemonSet
    name: datadog
  updatePolicy:
    updateMode: Auto

Logs (formatted) from the VPA admission controller (verbosity level -v=20):

Sending patches: [
  {add /spec/containers/0/resources/requests/cpu 224m}
  {add /spec/containers/0/resources/requests/memory 203699302}
  {add /spec/containers/0/resources/limits/cpu 1120m}
  {add /spec/containers/0/resources/limits/memory 1018496510}
  {add /spec/containers/1/resources/requests/cpu 100m}
  {add /spec/containers/1/resources/requests/memory 100Mi}
  {add /spec/containers/1/resources/limits/cpu 500m}
  {add /spec/containers/1/resources/limits/memory 500Mi}
  {add /spec/containers/2/resources/requests/cpu 100m}
  {add /spec/containers/2/resources/requests/memory 100Mi}
  {add /spec/containers/2/resources/limits/cpu 500m}
  {add /spec/containers/2/resources/limits/memory 500Mi}
  {add /metadata/annotations/vpaUpdates Pod resources updated by datadog: container 0: cpu request, memory request, cpu limit, memory limit; container 1: cpu request, memory request, cpu limit, memory limit; container 2: cpu request, memory request, cpu limit, memory limit}
  {add /metadata/annotations/vpaObservedContainers agent, process-agent, trace-agent}]

Screenshot from the API-server audit-log, stored in AWS Cloudwatch:

demikl commented 2 years ago

Exactly the same setup, but just changing the tag of the image used in the "vpa-admission-controller" deployment from 0.10.0 to 0.9.2:

demikl commented 2 years ago

I have enabled the API server logs to see if it would complain about something not expected in the response from the mutating webhook, but nothing showed up as far as I can tell. I searched for "vpa-webhook-config" in those logs.

stoyanr commented 2 years ago

We observed this issue while testing gardener with VPA 0.10.0 as well. We saw the following error in kube-apiserver logs:

Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": converting (v1.AdmissionReview) to (v1beta1.AdmissionReview): unknown conversion

We realized that our vpa-webhook-config MutatingWebhookConfiguration only listed v1beta1 in admissionReviewVersions, but VPA 0.10.0 is returning v1:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: vpa-webhook-config-seed
  ...
webhooks:
- admissionReviewVersions:
  - v1beta1

Once we changed the above version to v1, the errors in the kube-apiserver logs were gone, and the issue itself disappeared.

It seems that 0.10.0 requires changing to (or adding) the v1 admission review version in the MutatingWebhookConfiguration. Perhaps this should be considered a breaking change and documented properly?

/cc @voelzmo

voelzmo commented 2 years ago

I guess this is the commit where the relevant change was introduced.

jbartosik commented 2 years ago

Thank you for investigating this. I'm trying to reproduce the problem

jbartosik commented 2 years ago

I tried to reproduce the situation @stoyanr describes in this comment:

I created a cluster,
I deployed VPA 0.9.2 in the cluster,
I updated only VPA deployments to use 0.10.0 image

And VPA works correctly. Webhook in the cluster is defined with v1. This is because when VPA Admission Controller starts it deletes existing webhook if there is one then creates a new one.

It looks like something else is modifying the webhook after VPA admission controller starts (if Admission Controller would die if it couldn't delete old webhook or create new one).

I'll try modifying webhook after Admission Webhook 0.10.0 starts.

jbartosik commented 2 years ago

I modified the webhook to use v1beta1 API after Admisison Controller started and now e2e tests I'm running on the cluster are failing. I can see that Updater is evicting pos, recommendations are changing, Admission Controller is processing pods but requests are not changing.

jbartosik commented 2 years ago

So it looks like something other than VPA Admission Controller is changing the webhook and as a result VPA Admission Controller doesn't work properly.

jbartosik commented 2 years ago

Another possibility is that you're running VPA Admission controller with registerWebhook set to false.

stoyanr commented 2 years ago

Another possibility is that you're running VPA Admission controller with registerWebhook set to false.

Yes, we are doing that, since we need better control over the webhook registration. The point is that it is currently not obvious from the release notes of 0.10.0 that in this case we must update the webhook registration to v1 as well when updating to 0.10.0, so we had to find it the hard way.

demikl commented 2 years ago

I'm using a gitops workflow to deploy and reconcile technical tools on my clusters, so yes, FluxCD is overwriting the mutating admission webhook configuration that VPA has updated on start-up.

I'll change the webhook configuration in the git repository that FluxCD is using. Thanks for the analysis.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jbartosik commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/4733#issuecomment-1353211222): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

VPA restarts my pods but does not modify CPU or memory settings #4733