Open bskiba opened 4 years ago
@bskiba any updates on this?
This would be awesome!! How can I help?
I've been digging around the code for a bit, this is what I understand so far, please correct me where I'm wrong 😃
To answer this:
how recommendations are calculated out of raw samples for CPU and Memory.
recommendations are calculated using decaying histogram of weighted samples from the metrics server, where the newer samples are assigned higher weights; older samples are decaying and hence affect less and less w.r.t. to the recommendations. CPU is calculated using the 90th percentile of all cpu samples, and memory is calculated using the 90th percentile peak over the 8 day window.
when it is reasonable to expect a stable recommendation for a new workload
8 days of history is used for recommendation (1 memory usage sample per day). Prometheus can be used a history provider in this calculation. By default, vpa is collecting data about all controllers, so when new vpa objects are created, they are already providing stable recommendations (unless you specify memory-save=true
). All active vpa recommendations are checkpointed.
Please note that VPA recommendation algorithm is not part of the API and is subject to change without notice
saw this in the code here 🙃
...I'm not sure if it's possible to get a "stable" recommendation before 8 days...
@yashbhutwala Great Summary!!
I am getting a huge upper bound for my recommendation at the startup and trying to understand the behavior.
Below is the VPA object.
Name: xxx-vpa Namespace: xxxx Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"autoscaling.k8s.io/v1beta2","kind":"VerticalPodAutoscaler","metadata":{"annotations":{},"name":"kube-apiserver-vpa","namesp... API Version: autoscaling.k8s.io/v1 Kind: VerticalPodAutoscaler Metadata: Creation Timestamp: 2020-04-30T20:39:43Z Generation: 165 Resource Version: 1683082 Self Link: xxxx UID: <some_number> Spec: Resource Policy: Container Policies: Container Name: c_name Controlled Resources: cpu memory Min Allowed: Cpu: 100m Memory: 400Mi Container Name: c_name2 Mode: Off Target Ref: API Version: apps/v1beta2 Kind: StatefulSet Name: name Update Policy: Update Mode: Auto Status: Conditions: Last Transition Time: 2020-04-30T20:40:07Z Status: True Type: RecommendationProvided Recommendation: Container Recommendations: Container Name: c_name Lower Bound: Cpu: 100m Memory: 400Mi Target: Cpu: 125m Memory: 903203073 Uncapped Target: Cpu: 125m Memory: 903203073 Upper Bound: Cpu: 2967m Memory: 16855113438 Events: <none>
I don't want to set an upper limit in the VPA object. I don't have checkpoints as history is loaded from prometheus server. But I noticed this huge upperbound (numbers may be slightly different ) irrespective of whether I load from check point or prometheus. Can you tell why does the algorithm give such a high upper bound?
Also, there are no OOM events in VPA recommender logs.
I did the same experiment without prometheus server and got similar numbers. I checked the checkpoint of VPA
"status": { "cpuHistogram": { "bucketWeights": { "1": 10000, "10": 728, "11": 1088, "12": 121, "2": 2891, "3": 1009, "4": 686, "5": 240, "8": 41, "9": 5436 }, "referenceTimestamp": "2020-05-01T00:00:00Z", "totalWeight": 51.24105164098685 }, "firstSampleStart": "2020-04-30T20:39:30Z", "lastSampleStart": "2020-04-30T22:11:02Z", "lastUpdateTime": null, "memoryHistogram": { "referenceTimestamp": "2020-05-02T00:00:00Z" }, "totalSamplesCount": 552, "version": "v3" }
The surprising case is there is no memory histogram. Is this because it will only appear after 24 hr?
I deleted the VPA object, checkpoint and then restarted the VPA object but still getting huge upper bounds after 2 hours of startup. How is it recommending memory without any histogram?
Can you please answer this?
@djjayeeta good questions!! I'm not an expert here, but as far as I understand it, the most important value for you to look at is Target. This is the recommendation given for you to set the requests to. There is no limit recommendation given by VPA currently.
Lower and Upper bound are meant to only be used by the VPA updater to allow pod to be running if the requests are in that range; and not evict them. For upper bound, I suspect this is set by default to node's capacity (as in your case this is 16Gi). just fyi, Uncapped Target gives the recommendation before applying constraints specified in the VPA spec, such as min or max.
With this in mind, in your case, the target of 125m
cpu and 0.85Gi
(903203073 bytes) mem seems reasonable.
The surprising case is there is no memory histogram. Is this because it will only appear after 24 hr?
Yes, it samples the peak per day
@yashbhutwala, thanks for taking time to answer here, your answer is very precise. 👍 If you would like, could you add this answer to the FAQ? I think it would prove very useful to other users
@djjayeeta high upper bound at startup would be due to confidence factor scaling. With more data, it will go closer towards the 95th percentile. https://github.com/kubernetes/autoscaler/blob/87eae1d207742bef168bf40e842b5a78b0600a26/vertical-pod-autoscaler/pkg/recommender/logic/recommender.go#L129 https://github.com/kubernetes/autoscaler/blob/87eae1d207742bef168bf40e842b5a78b0600a26/vertical-pod-autoscaler/pkg/recommender/logic/estimator.go#L114
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Hi, Can I work on this? I think there is enough information added in the comment: https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-616037197 and https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-622252843 .
I can rephrase it and add in the FAQ: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Hi @yashbhutwala, is it still the case that the VPA does not recommend the resource limits and only the requests?
@Duske yes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
I do not find relevant instructions of Recommender Components algorithm in the project. It is still a little complicated to learn from the code. Is there no documentation of the algorithm in the community?
eg: initialize histogram algorithm and how is it designed. calculate the recommended value algorithm.
I agree. I would like to know what heuristics are applied in the algorithm to ensure that correct target values are sent to different services with different usage patterns.
/reopen
I don't think this is solved and it is very hard to find information about this
@alvaroaleman: Reopened this issue.
/lifecycle frozen
I am wondering why the VPA recommender uses targetMemoryPeaksPercentile := 0.9. Shouldn't it use the maximum observed memory usage to avoid OOM kill?
@ManuelMueller1st it is a request not a limit.
Document
Please note that VPA recommendation algorithm is not part of the API and is subject to change without notice