Closed voelzmo closed 1 year ago
ping @jbartosik that's what I was mentioning in today's SIG call
maybe a.TotalSamplesCount++ should run when OOM is detected...
I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.
Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.
My plan is:
I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back
I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.
Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.
My plan is:
I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back
Ah, it's good to hear you already saw something similar!
My plan is:
Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation, Add logging to VPA recommender to see if it's getting information about OOMs (I think here) If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case), If we don't get the information read up / ask about how we could get it, If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.
I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back
I can also take some time to do this – I don't think the scenario should be too far away from my repro case above. The modifications to the existing OOMObserver makes sense to verify that the correct information is really there – in my repro case above I thought seeing the logs here was sufficient evidence that the VPA sees the OOM events with the right amount of memory, and that adding a TotalSampleCount++
lead to getting the correct recommendation showed that the information in the OOM events was as expected.
Adapted the existing OOMKill test, such that the Pods run more quickly into OOMKills and eventually end in a CrashLoopBackOff: https://github.com/kubernetes/autoscaler/pull/5028
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/remove-lifecycle stale
Which component are you using?: vertical-pod-autoscaler
What version of the component are you using?:
Component version: 0.10.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
What did you expect to happen?: VPA should be able to help with Pods which are in an OOMKill CrashLoopBackOff and raise Limits/Requests until the workload is running.
What happened instead?: VPA did not give a single Recommendation for a Pod that right from the start goes into an OOMKill CrashLoopBackOff
How to reproduce it (as minimally and precisely as possible): Create a deployment that will be OOMKilled right after starting
Look at the container
Create a VPA object for this deployment
VPA does observe the corresponding OOMKill events in the Recommender logs
VPA Status is empty
VPACheckpoint doesn't record any measurements
The Pod in CrashLoopBackOff doesn't have any PodMetrics, whereas other Pods do have metrics
The above
List
call is what the VPA Recommender uses to get metrics for all the Pods and then increases theTotalSamplesCount
for the individual Containers for every CPUSample in that List of Podmetrics.OOMKill events are recorded as MemorySamples, therefore, they also don't increase the
TotalSamplesCount
.This container most likely doesn't get any recommendation, because its
TotalSamplesCount
is0
Seems like others have seen this as well (and tried to resolve this by switching to a different metrics source): https://github.com/kubernetes-sigs/metrics-server/issues/976#issuecomment-1076102124
People really don't want metrics for terminated containers, these things were added intentionally:
terminated
containers – this resulted in Pods withinit-containers
not having any metrics when acAdvisor
refactoring included init containers in the kubelet summary API again [1, 2]running
ContainersAnything else we need to know?: On the same cluster, the
hamster
example works perfectly fine and gets recommendations as expected, so this is not a general issue with the VPA.I just for fun applied this patch which increases the
TotalSamplesCount
when a memory sample (i.e. also an OOMKill sample) is added and afterwards the above Pod gets a recommendation and can run normally – as expected. I understand that the fix cannot be as simple as that, otherwise we would add two samples for every regular PodMetric (which contains both, CPU and memory), and existing implementations assume otherwise, I guess, but this is just to show thatTotalSamplesCount
seems to be the blocker in this situation.