[VPA] Weird restarts ("multiple configs")

R-Studio commented 1 year ago

Which component are you using?: vertical-pod-autoscaler (recommender, updater & admission-controller)

What version of the component are you using?: v0.13 (Image-Tag: 0.13.0, Fairwirds-Helm Chart: v1.7.2)

Component version: What k8s version are you using (kubectl version)?: OpenShift 4.11, Kubernetes v1.24.12+ceaf338

What environment is this in?: OnPrem VMs

What behaviour did you expect to see?: No unreasonable pod evictions/restarts

What happened instead?: Unseasonable pod evictions/restarts In the following screenshot we can see that VPA set the CPU requests from 0.055 to 0.043 for 2 minutes and then back to 0.055 again: I have noticed the following log, why 5 configs? (I only have one VerticalPodAutoscaler for this deployment deployed) matcher.go:68] Let's choose from 5 configs for pod NAMESPACE/xxx-85b97d6dfc-%

How to reproduce it (as minimally and precisely as possible): I am not sure if it is reproducible

I use the following arguments: Recommender:

v: "4"
target-cpu-percentile: 0.50
pod-recommendation-min-cpu-millicores: 10
pod-recommendation-min-memory-mb: 10
recommendation-margin-fraction: 0.0
memory-saver: true

Updater:

v: "4"
min-replicas: 1

AdmissionController: no arguments than the defaults

R-Studio commented 1 year ago

I noticed that vpa-updater protects the newly created pod only for 1min, why?

voelzmo commented 1 year ago

Hey @R-Studio, thanks for providing some insight into your investigations! I'm not able to answer what's going on completely, so here's just a few pointers to clear the fog step-by-step:

the log statement Let's choose from X configs is coming from the admission-webhook that reacts on CREATE Pod. For each Pod, it is trying to find a matching VPA, and you currently have 5 VPAs in the Pod's namespace (kubectl -n NAMESPACE get vpa should show this)
Looking at the single graph you showed makes it hard to follow what happened. My recommendation is to print additional information into the same diagram, as the updater evicts a Pod in the following scenarios:
- the request is outside the recommended range for at least one container
- the pod lives for at least 12h and the resource diff is >= MinChangePriority
- a vpa scaled container OOMed in less than evictAfterOOMThreshold

In you case, we can probably rule out cases 2 and 3, given that you're only scaling on CPU (you didn't explicitly mention this, but I was assuming it from the way you investigated this) and that the Pod was evicted just minutes before. So most likely the current requests are outside the recommended range for one of the containers in your Pod. To analyze this, it helps to also draw the upper and lower bounds and see when/how a Container's current requests end up being outside the bounds of the new recommendation (which is case 1 above). This would look like this

In this example graph, we see the "current requests" between the "lower bound", and "upper bound", so we're not in case 1. The new recommendation is lower than the current requests, though, and after 12 hours the Pod would be evicted to apply the new recommendation, as there is more than 10% difference.

A graph like this should show you, why your Pod is getting evicted and hopefully explain what happens.

R-Studio commented 1 year ago

@voelzmo thanks for your reply! 👍🏽 First you are right we are only scaling on CPU requests (sorry I forgot to mention this).
Here an example of our VPA resources:

---
apiVersion: "autoscaling.k8s.io/v1"
kind: VerticalPodAutoscaler
metadata:
  name: vpa-argocd-applicationset-controller
  namespace: argocd
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: argocd-applicationset-controller
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 10m
        maxAllowed:
          cpu: 3
        controlledResources: ["cpu"]
        controlledValues: "RequestsOnly"

Regarding to Let's choose from X configs: you are absolutely right. I thought vpa finds multiple configurations for the same pod 😄😅
Regarding to more information's in the graph: Unfortunately we are using OpenShift and there is a bug that does not allow us to collect the VPA metrics: https://access.redhat.com/solutions/6301491

Anyway thanks for all your inputs! 👍🏽

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

voelzmo commented 8 months ago

/close /kind support

k8s-ci-robot commented 8 months ago

@voelzmo: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/6010#issuecomment-1926483207): >/close >/kind support Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

voelzmo commented 8 months ago

/remove-kind bug

kubernetes / autoscaler

[VPA] Weird restarts ("multiple configs") #6010