Closed emla9 closed 5 months ago
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
This is interesting and possibly related to #6705
@emla9: Are you observing the same messages in the recommender
logs as described in that issue? We might close this one and move the conversation to #6705 instead, as it has a bit more context already.
Thanks for taking a look at this, @voelzmo. I reran the stress test with VPA 1.0.0 to be sure: there are no KeyError
s in the recommender
logs about vpa-stress
pods. When we initially noticed this issue with the datadog-agent, we did see some KeyError
s, but only in 6 out of 62 total OOMs observed during the course of an hour. #6705 possibly contributes to the issue here when OOMs are very fast, but the problem exists even without anyKeyError
s.
Hey @emla9 thanks for checking for those KeyErrors
. As mentioned in #6660, I think I understand how this would solve the issue you're describing here.
Hi guys!
I saw that this was merged into main, but I think it didn't make it into vpa-recommender:1.1.2
, since it's throwing an error saying unknown flag: --target-memory-percentile
.
Is there any build/image with these changes yet?
Or would I need to build a custom image from main?
Which component are you using?:
vertical-pod-autoscaler
What version of the component are you using?:
Component version: v1.0.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
EKS 1.25
What did you expect to happen?:
Expected that VPA would respond to OOM by adjusting the memory recommendation according to the documented formula:
What happened instead?:
Containers OOM continuously; VPA's recommendation is never adjusted.
How to reproduce it (as minimally and precisely as possible):
The problem was originally observed with the
datadog-agent
Daemonset. Its resource needs can vary by node depending on how many pods are running there and the amount of metrics emitted. Sometimes the gap is quite signifant, like by a factor of 6. The issue is reproducible with a stress test.Create a Deployment whose pods allocate a random amount of memory such that ~10% of them should OOM:
Most
vpa-stress
pods will allocate between 100-160Mi of memory. The remaining 10% will allocate 650Mi of memory, causing OOM. If no OOM occurs after a couple of minutes, replace some pods to roll the dice again.Turn up verbosity on the VPA updater:
Create a VPA for the Deployment:
Expect logs like this from the VPA updater:
Anything else we need to know?:
VPA responds to OOM as expected when resource utilization across pods is uniform. As far as I can tell, the issue arises from the fact that only a small percentage of pods actually experience OOM. The target memory percentile of 90% is not affected by relatively infrequent OOM samples. This means that VPA never recommends a memory increase.
In applying VPA to a Daemonset such as the
datadog-agent
that may have non-uniform memory usage, I do not expect to reduce resource waste (all pods are subject to the same recommendation), but rather to reduce toil around manually adjusting the Daemonset's memory requests and limits as its requirements change with the workloads running on the cluster.