VPA doesn't provide any recommendations when Pod is in OOMKill CrashLoopBackoff right after start

voelzmo commented 2 years ago

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?:

Component version: 0.10.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version --short
Client Version: v1.24.2
Kustomize Version: v4.5.4
Server Version: v1.23.4

What environment is this in?:

What did you expect to happen?: VPA should be able to help with Pods which are in an OOMKill CrashLoopBackOff and raise Limits/Requests until the workload is running.

What happened instead?: VPA did not give a single Recommendation for a Pod that right from the start goes into an OOMKill CrashLoopBackOff

How to reproduce it (as minimally and precisely as possible): Create a deployment that will be OOMKilled right after starting

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oomkilled
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oomkilled
  template:
    metadata:
      labels:
        app: oomkilled
    spec:
      containers:
      - image: gcr.io/google-containers/stress:v1
        name: stress
        command: [ "/stress"]
        args: 
          - "--mem-total"
          - "104858000"
          - "--logtostderr"
          - "--mem-alloc-size"
          - "10000000"
        resources:
          requests:
            memory: 1Mi
            cpu: 5m
          limits:
            memory: 20Mi

Look at the container

(...)
    State:          Waiting                                                                                                                                                                                                                                                                    
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 20 Jun 2022 16:56:47 +0200
      Finished:     Mon, 20 Jun 2022 16:56:48 +0200
    Ready:          False
    Restart Count:  5
(...)

Create a VPA object for this deployment

apiVersion: "autoscaling.k8s.io/v1"
kind: VerticalPodAutoscaler
metadata:
  name: oomkilled-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: oomkilled
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 5m
          memory: 10Mi
        maxAllowed:
          cpu: 1
          memory: 5Gi
        controlledResources: ["cpu", "memory"]

VPA does observe the corresponding OOMKill events in the Recommender logs

I0620 14:55:04.340502       1 cluster_feeder.go:465] OOM detected {Timestamp:2022-06-20 14:53:52 +0000 UTC Memory:1048576 ContainerID:{PodID:{Namespace:default PodName:oomkilled-6868f896d6-6vfqm} ContainerName:stress}}
I0620 14:55:04.340545       1 cluster_feeder.go:465] OOM detected {Timestamp:2022-06-20 14:54:08 +0000 UTC Memory:1048576 ContainerID:{PodID:{Namespace:default PodName:oomkilled-6868f896d6-6vfqm} ContainerName:stress}}

VPA Status is empty

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling.k8s.io/v1","kind":"VerticalPodAutoscaler","metadata":{"annotations":{},"name":"oomkilled-vpa","namespace":"default"},"spec":{"resourcePolicy":{"containerPolicies":[{"containerName":"*","controlledResources":["cpu","memory"],"maxAllowed":{"cpu":1,
"memory":"5Gi"},"minAllowed":{"cpu":"5m","memory":"10Mi"}}]},"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"oomkilled"}}}
  creationTimestamp: "2022-06-20T14:54:16Z"
  generation: 2
  name: oomkilled-vpa
  namespace: default
  resourceVersion: "299374"
  uid: f47d84a8-aa6e-4042-b0a4-723888720a9d
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources:
      - cpu
      - memory
      maxAllowed:
        cpu: 1
        memory: 5Gi
      minAllowed:
        cpu: 5m
        memory: 10Mi
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: oomkilled
  updatePolicy:
    updateMode: Auto
status:
  conditions:
  - lastTransitionTime: "2022-06-20T14:55:04Z"
    status: "False"
    type: RecommendationProvided
  recommendation: {}

VPACheckpoint doesn't record any measurements

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscalerCheckpoint
metadata:
  creationTimestamp: "2022-06-20T14:55:04Z"
  generation: 24
  name: oomkilled-vpa-stress
  namespace: default
  resourceVersion: "304997"
  uid: 127a6331-7d1d-4ea6-b56a-63db3ee07a51
spec:
  containerName: stress
  vpaObjectName: oomkilled-vpa
status:
  cpuHistogram:
    referenceTimestamp: null
  firstSampleStart: null
  lastSampleStart: null
  lastUpdateTime: "2022-06-20T15:18:04Z"
  memoryHistogram:
    referenceTimestamp: "2022-06-22T00:00:00Z"
  version: v3

The Pod in CrashLoopBackOff doesn't have any PodMetrics, whereas other Pods do have metrics

k get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/" | jq
{
  "kind": "PodMetricsList",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metadata": {
        "name": "hamster-96d4585b7-b9tl9",
        "namespace": "default",
        "creationTimestamp": "2022-06-20T15:24:37Z",
        "labels": {
          "app": "hamster",
          "pod-template-hash": "96d4585b7"
        }
      },
      "timestamp": "2022-06-20T15:24:01Z",
      "window": "56s",
      "containers": [
        {
          "name": "hamster",
          "usage": {
            "cpu": "498501465n",
            "memory": "512Ki"
          }
        }
      ]
    },
    {
      "metadata": {
        "name": "hamster-96d4585b7-c44j7",
        "namespace": "default",
        "creationTimestamp": "2022-06-20T15:24:37Z",
        "labels": {
          "app": "hamster",
          "pod-template-hash": "96d4585b7"
        }
      },
      "timestamp": "2022-06-20T15:24:04Z",
      "window": "57s",
      "containers": [
        {
          "name": "hamster",
          "usage": {
            "cpu": "501837091n",
            "memory": "656Ki"
          }
        }
      ]
    }
  ]
}

The above List call is what the VPA Recommender uses to get metrics for all the Pods and then increases the TotalSamplesCount for the individual Containers for every CPUSample in that List of Podmetrics.
OOMKill events are recorded as MemorySamples, therefore, they also don't increase the TotalSamplesCount.
This container most likely doesn't get any recommendation, because its TotalSamplesCount is 0
Seems like others have seen this as well (and tried to resolve this by switching to a different metrics source): https://github.com/kubernetes-sigs/metrics-server/issues/976#issuecomment-1076102124
People really don't want metrics for terminated containers, these things were added intentionally:
- metrics-server intentionally doesn't provide PodMetrics for Pods with terminated containers – this resulted in Pods with init-containers not having any metrics when a cAdvisor refactoring included init containers in the kubelet summary API again [1, 2]
- kubelet is meant to only provide metrics for non-terminated Pods and for running Containers

Anything else we need to know?: On the same cluster, the hamster example works perfectly fine and gets recommendations as expected, so this is not a general issue with the VPA.

I just for fun applied this patch which increases the TotalSamplesCount when a memory sample (i.e. also an OOMKill sample) is added and afterwards the above Pod gets a recommendation and can run normally – as expected. I understand that the fix cannot be as simple as that, otherwise we would add two samples for every regular PodMetric (which contains both, CPU and memory), and existing implementations assume otherwise, I guess, but this is just to show that TotalSamplesCount seems to be the blocker in this situation.

diff --git a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
index 3facbe37e..7accd072e 100644
--- a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
+++ b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
@@ -184,6 +184,7 @@ func (a *AggregateContainerState) AddSample(sample *ContainerUsageSample) {
        case ResourceCPU:
                a.addCPUSample(sample)
        case ResourceMemory:
+               a.TotalSamplesCount++
                a.AggregateMemoryPeaks.AddSample(BytesFromMemoryAmount(sample.Usage), 1.0, sample.MeasureStart)
        default:
                panic(fmt.Sprintf("AddSample doesn't support resource '%s'", sample.Resource))

voelzmo commented 2 years ago

ping @jbartosik that's what I was mentioning in today's SIG call

mikelo commented 2 years ago

maybe a.TotalSamplesCount++ should run when OOM is detected...

jbartosik commented 2 years ago

I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.

Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.

My plan is:

Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation,
Add logging to VPA recommender to see if it's getting information about OOMs (I think here)
If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case),
If we don't get the information read up / ask about how we could get it,
If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

jbartosik commented 2 years ago

I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.

Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.

My plan is:

Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation,
Add logging to VPA recommender to see if it's getting information about OOMs (I think here)
If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case),
If we don't get the information read up / ask about how we could get it,
If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

voelzmo commented 2 years ago

Ah, it's good to hear you already saw something similar!

My plan is:

Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation, Add logging to VPA recommender to see if it's getting information about OOMs (I think here) If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case), If we don't get the information read up / ask about how we could get it, If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

I can also take some time to do this – I don't think the scenario should be too far away from my repro case above. The modifications to the existing OOMObserver makes sense to verify that the correct information is really there – in my repro case above I thought seeing the logs here was sufficient evidence that the VPA sees the OOM events with the right amount of memory, and that adding a TotalSampleCount++ lead to getting the correct recommendation showed that the information in the OOM events was as expected.

voelzmo commented 2 years ago

Adapted the existing OOMKill test, such that the Pods run more quickly into OOMKills and eventually end in a CrashLoopBackOff: https://github.com/kubernetes/autoscaler/pull/5028

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jbartosik commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

voelzmo commented 1 year ago

/remove-lifecycle stale

runningman84 commented 1 year ago

/remove-lifecycle stale

kubernetes / autoscaler

VPA doesn't provide any recommendations when Pod is in OOMKill CrashLoopBackoff right after start #4981