kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.08k stars 3.97k forks source link

VPA - Document the current recommendation algorithm #2747

Open bskiba opened 4 years ago

bskiba commented 4 years ago

Document

Please note that VPA recommendation algorithm is not part of the API and is subject to change without notice

hochuenw-dd commented 4 years ago

@bskiba any updates on this?

yashbhutwala commented 4 years ago

This would be awesome!! How can I help?

yashbhutwala commented 4 years ago

I've been digging around the code for a bit, this is what I understand so far, please correct me where I'm wrong 😃

To answer this:

how recommendations are calculated out of raw samples for CPU and Memory.

recommendations are calculated using decaying histogram of weighted samples from the metrics server, where the newer samples are assigned higher weights; older samples are decaying and hence affect less and less w.r.t. to the recommendations. CPU is calculated using the 90th percentile of all cpu samples, and memory is calculated using the 90th percentile peak over the 8 day window.

when it is reasonable to expect a stable recommendation for a new workload

8 days of history is used for recommendation (1 memory usage sample per day). Prometheus can be used a history provider in this calculation. By default, vpa is collecting data about all controllers, so when new vpa objects are created, they are already providing stable recommendations (unless you specify memory-save=true). All active vpa recommendations are checkpointed.

Please note that VPA recommendation algorithm is not part of the API and is subject to change without notice

saw this in the code here 🙃

...I'm not sure if it's possible to get a "stable" recommendation before 8 days...

djjayeeta commented 4 years ago

@yashbhutwala Great Summary!!

I am getting a huge upper bound for my recommendation at the startup and trying to understand the behavior.

Below is the VPA object.

Name: xxx-vpa Namespace: xxxx Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"autoscaling.k8s.io/v1beta2","kind":"VerticalPodAutoscaler","metadata":{"annotations":{},"name":"kube-apiserver-vpa","namesp... API Version: autoscaling.k8s.io/v1 Kind: VerticalPodAutoscaler Metadata: Creation Timestamp: 2020-04-30T20:39:43Z Generation: 165 Resource Version: 1683082 Self Link: xxxx UID: <some_number> Spec: Resource Policy: Container Policies: Container Name: c_name Controlled Resources: cpu memory Min Allowed: Cpu: 100m Memory: 400Mi Container Name: c_name2 Mode: Off Target Ref: API Version: apps/v1beta2 Kind: StatefulSet Name: name Update Policy: Update Mode: Auto Status: Conditions: Last Transition Time: 2020-04-30T20:40:07Z Status: True Type: RecommendationProvided Recommendation: Container Recommendations: Container Name: c_name Lower Bound: Cpu: 100m Memory: 400Mi Target: Cpu: 125m Memory: 903203073 Uncapped Target: Cpu: 125m Memory: 903203073 Upper Bound: Cpu: 2967m Memory: 16855113438 Events: <none>

I don't want to set an upper limit in the VPA object. I don't have checkpoints as history is loaded from prometheus server. But I noticed this huge upperbound (numbers may be slightly different ) irrespective of whether I load from check point or prometheus. Can you tell why does the algorithm give such a high upper bound?

Also, there are no OOM events in VPA recommender logs.

I did the same experiment without prometheus server and got similar numbers. I checked the checkpoint of VPA "status": { "cpuHistogram": { "bucketWeights": { "1": 10000, "10": 728, "11": 1088, "12": 121, "2": 2891, "3": 1009, "4": 686, "5": 240, "8": 41, "9": 5436 }, "referenceTimestamp": "2020-05-01T00:00:00Z", "totalWeight": 51.24105164098685 }, "firstSampleStart": "2020-04-30T20:39:30Z", "lastSampleStart": "2020-04-30T22:11:02Z", "lastUpdateTime": null, "memoryHistogram": { "referenceTimestamp": "2020-05-02T00:00:00Z" }, "totalSamplesCount": 552, "version": "v3" }

The surprising case is there is no memory histogram. Is this because it will only appear after 24 hr?

I deleted the VPA object, checkpoint and then restarted the VPA object but still getting huge upper bounds after 2 hours of startup. How is it recommending memory without any histogram?

Can you please answer this?

yashbhutwala commented 4 years ago

@djjayeeta good questions!! I'm not an expert here, but as far as I understand it, the most important value for you to look at is Target. This is the recommendation given for you to set the requests to. There is no limit recommendation given by VPA currently.

Lower and Upper bound are meant to only be used by the VPA updater to allow pod to be running if the requests are in that range; and not evict them. For upper bound, I suspect this is set by default to node's capacity (as in your case this is 16Gi). just fyi, Uncapped Target gives the recommendation before applying constraints specified in the VPA spec, such as min or max.

With this in mind, in your case, the target of 125m cpu and 0.85Gi (903203073 bytes) mem seems reasonable.

The surprising case is there is no memory histogram. Is this because it will only appear after 24 hr?

Yes, it samples the peak per day

bskiba commented 4 years ago

@yashbhutwala, thanks for taking time to answer here, your answer is very precise. 👍 If you would like, could you add this answer to the FAQ? I think it would prove very useful to other users

avmohan commented 4 years ago

@djjayeeta high upper bound at startup would be due to confidence factor scaling. With more data, it will go closer towards the 95th percentile. https://github.com/kubernetes/autoscaler/blob/87eae1d207742bef168bf40e842b5a78b0600a26/vertical-pod-autoscaler/pkg/recommender/logic/recommender.go#L129 https://github.com/kubernetes/autoscaler/blob/87eae1d207742bef168bf40e842b5a78b0600a26/vertical-pod-autoscaler/pkg/recommender/logic/estimator.go#L114

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

yashbhutwala commented 4 years ago

/remove-lifecycle stale

shekhar-rajak commented 4 years ago

Hi, Can I work on this? I think there is enough information added in the comment: https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-616037197 and https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-622252843 .

I can rephrase it and add in the FAQ: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

bskiba commented 3 years ago

/remove-lifecycle stale

Duske commented 3 years ago

Hi @yashbhutwala, is it still the case that the VPA does not recommend the resource limits and only the requests?

yashbhutwala commented 3 years ago

@Duske yes.

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-927060549): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
gosoon commented 2 years ago

I do not find relevant instructions of Recommender Components algorithm in the project. It is still a little complicated to learn from the code. Is there no documentation of the algorithm in the community?

eg: initialize histogram algorithm and how is it designed. calculate the recommended value algorithm.

ashvinsharma commented 1 year ago

I agree. I would like to know what heuristics are applied in the algorithm to ensure that correct target values are sent to different services with different usage patterns.

alvaroaleman commented 1 year ago

/reopen

I don't think this is solved and it is very hard to find information about this

k8s-ci-robot commented 1 year ago

@alvaroaleman: Reopened this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/2747#issuecomment-1546076131): >/reopen > >I don't think this is solved and it is very hard to find information about this Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
alvaroaleman commented 1 year ago

/lifecycle frozen

ManuelMueller1st commented 10 months ago

I am wondering why the VPA recommender uses targetMemoryPeaksPercentile := 0.9. Shouldn't it use the maximum observed memory usage to avoid OOM kill?

https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/recommender/logic/recommender.go#L108

pierreozoux commented 2 months ago

@ManuelMueller1st it is a request not a limit.