[VPA] Does not respond to OOM for workloads with non-uniform resource utilization

emla9 commented 9 months ago

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: v1.0.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: v1.25.11
Kustomize Version: v4.5.7
Server Version: v1.25.16-eks-8cb36c9

What environment is this in?:

EKS 1.25

What did you expect to happen?:

Expected that VPA would respond to OOM by adjusting the memory recommendation according to the documented formula:

recommendation = memory-usage-in-oomkill-event + max(oom-min-bump-up-bytes, memory-usage-in-oomkill-event * oom-bump-up-ratio)

What happened instead?:

Containers OOM continuously; VPA's recommendation is never adjusted.

How to reproduce it (as minimally and precisely as possible):

The problem was originally observed with the datadog-agent Daemonset. Its resource needs can vary by node depending on how many pods are running there and the amount of metrics emitted. Sometimes the gap is quite signifant, like by a factor of 6. The issue is reproducible with a stress test.

Create a Deployment whose pods allocate a random amount of memory such that ~10% of them should OOM:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vpa-stress
spec:
  replicas: 10
  selector:
    matchLabels:
      app: vpa-stress
  template:
    metadata:
      labels:
        app: vpa-stress
    spec:
      containers:
        - name: vpa-stress
          image: docker.io/elizabethla/vpa-stress:latest
          resources:
            limits:
              cpu: '1'
              memory: 200Mi
            requests:
              cpu: 50m
              memory: 200Mi

Most vpa-stress pods will allocate between 100-160Mi of memory. The remaining 10% will allocate 650Mi of memory, causing OOM. If no OOM occurs after a couple of minutes, replace some pods to roll the dice again.

Turn up verbosity on the VPA updater:

containers:
  - name: updater
    args:
      - --v=4

Create a VPA for the Deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-stress
spec:
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        controlledResources:
          - cpu
          - memory
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-stress
  updatePolicy:
    minReplicas: 1
    updateMode: Auto

Expect logs like this from the VPA updater:

I1228 16:14:47.820536       1 update_priority_calculator.go:114] quick OOM detected in pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h, container vpa-stress
I1228 16:14:47.820550       1 update_priority_calculator.go:140] not updating pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h because resource would not change

Anything else we need to know?:

VPA responds to OOM as expected when resource utilization across pods is uniform. As far as I can tell, the issue arises from the fact that only a small percentage of pods actually experience OOM. The target memory percentile of 90% is not affected by relatively infrequent OOM samples. This means that VPA never recommends a memory increase.

In applying VPA to a Daemonset such as the datadog-agent that may have non-uniform memory usage, I do not expect to reduce resource waste (all pods are subject to the same recommendation), but rather to reduce toil around manually adjusting the Daemonset's memory requests and limits as its requirements change with the workloads running on the cluster.

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

emla9 commented 6 months ago

/remove-lifecycle stale

voelzmo commented 6 months ago

This is interesting and possibly related to #6705 @emla9: Are you observing the same messages in the recommender logs as described in that issue? We might close this one and move the conversation to #6705 instead, as it has a bit more context already.

emla9 commented 5 months ago

Thanks for taking a look at this, @voelzmo. I reran the stress test with VPA 1.0.0 to be sure: there are no KeyErrors in the recommender logs about vpa-stress pods. When we initially noticed this issue with the datadog-agent, we did see some KeyErrors, but only in 6 out of 62 total OOMs observed during the course of an hour. #6705 possibly contributes to the issue here when OOMs are very fast, but the problem exists even without anyKeyErrors.

voelzmo commented 5 months ago

Hey @emla9 thanks for checking for those KeyErrors. As mentioned in #6660, I think I understand how this would solve the issue you're describing here.

felipewnp commented 4 months ago

Hi guys!

I saw that this was merged into main, but I think it didn't make it into vpa-recommender:1.1.2, since it's throwing an error saying unknown flag: --target-memory-percentile .

Is there any build/image with these changes yet?

Or would I need to build a custom image from main?

kubernetes / autoscaler

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420