kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.27k stars 39.71k forks source link

Allow custom tolerance levels for horizontal pod autoscalers #116984

Open GautamSinghania opened 1 year ago

GautamSinghania commented 1 year ago

What would you like to be added?

We have a configuration for HPA called horizontal-pod-autoscaler-tolerance in Kube Controller Manager, defaulted to 0.1.

An individual HPA should be allowed to set a custom value for this, overriding the default value. Sample yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sample-hpa
  namespace: holmes-seldon
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sample-deployment
  minReplicas: 20
  maxReplicas: 450
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 60
        type: Utilization
    type: Resource
  behavior:
    scaleDown:
        tolerance: 0.05  # Custom tolerance value for scaledown
    scaleUp:
        tolerance: 0.02 # Custom tolerance value for scaleUp

Why is this needed?

K8 Clusters in organizations are often used by multiple applications with different needs. Allowing individual HPA to set their own tolerance limits (and possibly other HPA related config) can help in handling different usecases smoothly.

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
GautamSinghania commented 1 year ago

/sig autoscaling

pbetkier commented 1 year ago

An alternative we were thinking of is removing tolerance altogether. I'm curious is tolerance at all beneficial for you?

GautamSinghania commented 1 year ago

Removing tolerance might be useful to us (honestly, we will have to test that). But, I imagine having tolerance would be good as general.

du2016 commented 1 year ago

i also need this Feature. Some services are very sensitive to delay. Using this Feature can reduce the number of scaling ,for example we need a feature to let cpu more than 60% to scale up,and low than 40%to scale down,40%-60% not scale down and up

chenshiwei-io commented 1 year ago

i also need this Feature. Some services are very sensitive to delay. Using this Feature can reduce the number of scaling ,for example we need a feature to let cpu more than 60% to scale up,and low than 40%to scale down,40%-60% not scale down and up

+1

GautamSinghania commented 1 year ago

For anyone following this thread from before, I have updated the requirement to have separate custom tolerance levels for scale up and scale down. I believe this will be a minimal and impactful change.

pbetkier commented 1 year ago

@GautamSinghania could you explain the rationale behind having a different tolerance for scaling up and scaling down? Is your use case for differing tolerance levels solvable by tuning behavior controls?

GautamSinghania commented 1 year ago

@pbetkier I feel that different tolerances are a general ask and should be doable. In my instance, it derives from the fact that pods take a long time to come up. Hence, I want to have a higher tolerance for scale down and lower tolerance for scale up. This helps me control the average values and scale up/down tolerances separately.

Tuning behavior controls will be a round-about way to do this, but can be done.

I feel that if the ask is too great or convoluted from the back-end, we could reduce it to a simple custom tolerance control. However, if different tolerances are doable, it would be a great offering.

BojanZelic commented 1 year ago

In our usecase we use external scalers and the granularity for control is not there without being able to set this tolerance value on a case-by-case basis;

simple example, a cron-based scaler: 100 replicas between 1-2PM 108 replicas between 2-3PM

The external metric returns either 100 or 108 depending on the time of day; But the scaleUp never happens because of the 10% tolerance default setting.

Here's an example:

currentReplicas = 100

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
  metrics:
  - external:
      metric:
        name: my-cron
        selector:
          matchLabels:
            cron-scale: true
      target:
        averageValue: "1"
        type: AverageValue

During first cron schedule, pods scale up to 100, ratio is 1/1 because 100 = 100 desiredReplicas = 100

status:
  currentMetrics:
  - external:
      current:
        averageValue: "1"
        value: "0"
      metric:
        name: my-cron
        selector:
          matchLabels:
            cron-scale: true

then during the second cron schedule: desiredReplicas still = 100 because 1080m/1 = 1.08 and is less then the 10% tolerance, so scaleUp is never triggered

status:
  currentMetrics:
  - external:
      current:
        averageValue: "1080m"
        value: "0"
      metric:
        name: my-cron
        selector:
          matchLabels:
            cron-scale: true
k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BojanZelic commented 10 months ago

/remove-lifecycle stale

kevin-bates commented 9 months ago

Our use case ties the HPA of a stateful set to its PVs. We want to scale up when the aggregate consumption of the PVs is 70%, and we use a custom metric to track that. However, the tolerance value "adjusts" the scale-up to occur at 77% so we need to factor in the tolerance and set our criteria to 63% (knowing it won't scale up until the 10% tolerance has been exceeded). As a result, we'd love to disable tolerance on this HPA config so that the value we use as the criteria is true.

It seems like adding these tolerations (preferably for both scale-up and down) would be preferred over its removal for backward compatibility alone.

RRethy commented 8 months ago

This would be useful for us as well. We have clusters which have workloads ranging in size from 10 pods to 100s of pods, so tolerance would need to be different per workload.

vignesh-gm commented 5 months ago

This will be super helpful to speed up our scale ups and also not affect the scale down rates. It should be part of having different policies for scale up and scale down. The ability to not have different tolerance values for each affects the effectiveness of these scaling policies

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

BojanZelic commented 1 month ago

/remove-lifecycle rotten

rashmichandrashekar commented 1 month ago

We have the same problem and the tolerance seems to be affecting scale up even when containers are getting OOMKilled. Are there any plans to address this?

alexpotv commented 1 month ago

We are also having the same problem here!

We have a deployment which takes infinitely-long tasks (connecting to a camera and streaming from it until stopped) from a pool and processes them. Each pod, with the resources it has assigned, can process up to 4 tasks at a time. We have set up an HPA on this deployment, which is based on an external metric of our app (the total number of tasks to be processed at a given time).

The HPA considers that if there is, on average, more than 4 tasks per processing pod, more pods are needed. For example, if there are currently 36 tasks and 9 replicas (an average of 4 tasks/replica, each replica is fully-loaded), and a new task is added (which means 37 tasks for 9 replicas, an average of 4.11 tasks/replica), the HPA will create a new replica.

This mechanism breaks at 40 tasks, because of this globally-set tolerance value. When 40 tasks are running on 10 replicas, the average number of tasks per replica is 4 (which means the HPA will not create more replicas). When adding a new task, for a total of 41 tasks for 10 replicas, we now average 4.1 tasks per replica, which is within the tolerance. The new task will never be picked up by a processing pod.

Because we are deploying our cluster in GKE, we cannot (I believe) configure the global tolerance value ourselves within the kube-controller-manager. This feature would fix our problem.

unityabir commented 1 week ago

We have similar need for a configurable tolerance per HPA We have multiple applications with over 1000 pods and cpu threshold of 70%, only when the cpu utilization reaches 77% (0.1 tolerance) the scale-up action occurs and and more than 100 pods are added to the to the application at once. In an optimal world, we would want the scale-up to be more gradual and start earlier which is possible with reduced toleration.

We don't want to reduce the tolerance for the entire cluster to avoid flapping behaviour in smaller application in which reduced tolerance won't be a good fit.

In case you are working on this issue, I am not even sure that tolerance is the best approach to go here. Instead of having a threshold and tolerance like this:

threshold: 60
tolerance: 0.1
# which will translate to
upscaleAt: 66%
downscaleAt: 54%

Maybe it could be better to directly select the thresholds for upscale and downscale:

threshold: 50
upscaleAt: 52%
downscaleAt: 40%

This way, HPA behaviour will easier to understand and configure