What would we like from a VPA Benchmark?

lallydd commented 1 year ago

To understand the capacity of a VPA deployment, in particular the recommender, what kind of performance measurements would we like from a benchmark?

Which component are you using?: This will likely end up as an end-to-end test running off of kind on a single box.

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Reduce worry about the VPA's capacity and performance for stakeholders and users
Provide a baseline for performance engineering.

lallydd commented 1 year ago

First thoughts:

A full-throughput case for determining how many VerticalPodAutoscaler objects a single recommender can handle in a cluster.
- Especially in the External Metrics case, where it has to be queried repeatedly.
Similar for latency for adding a new pod in the admission-controller.

jbartosik commented 1 year ago

Sounds good to me.

I think that first thing likely to slow down is processing VPA objects in recommender.

As we get more and more costly to process VPA objects in a cluster (see VPA recommenders RunOnce):

UpdateVPAs and MaintainCheckpoints will take more time,
Eventually the function will hit the timeout (default 1 minute),
Then we will start limiting how many VPA objects we checkpoint in each Run.
- This will be visible in the recommenders execution_latency_seconds metric (time spent on checkpoints will start going down) and in age of checkpoints (they will be getting older)
- I expect age of checkpoints will grow faster than linearly with number of VPAs (number of VPAs to checkpoint is growing and time for doing that shrinks) here
Eventually we hit minCheckpointsPerRun (default 10) and time we spend per loop starts increasing (visible in execution_latency_seconds metric), checkpoint age will keep increasing (I'd guess age slows down growth to linear here).

For latency of admission controller I'm less worried but it'd be good to check it too.

voelzmo commented 1 year ago

There's also recommendation_latency_seconds which is important to understand how well the recommender can deal with many objects being created at the same time.

@lallydd is this also about adjusting the history storage of the recommender to something external? Or do your external metric sources currently only provide real-time data? If we're also taking history storage into account, it would be interesting how long it takes to backfill the historic data, but AFAIK we don't have a metric for this yet.

One thing to keep in mind for these tests is that they rely on a big-enough sizing for the client-side rate limits towards kube-apiserver (see discussions in https://github.com/kubernetes/autoscaler/issues/4498 where I tried to re-configure our recommender to fit our scale and discussed with @jbartosik)

lallydd commented 1 year ago

@voelzmo The external metric sources only provide real-time data. The one we intend to use for our use case will need some work to handle high query rates without overwhelming any upstream services. Another team member here is working on a different recommender that can take percentiles directly from an external data source.

The local-testing configuration I've put together for #5153 can let us run the benchmark to exhaust the recommender safely.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5493#issuecomment-2182120339): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

What would we like from a VPA Benchmark? #5493