kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.59k stars 1.09k forks source link

Introduce tolerance setting in ScaleObject #5486

Open SpiritZhou opened 10 months ago

SpiritZhou commented 10 months ago

Proposal

Introduce a horizontal-pod-autoscaler-tolerance setting in ScaledObject, such as:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: testso
  namespace: test
spec:
  scaleTargetRef:
    name: sut
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        tolerance: 0.5
  triggers:
  - type: cpu
    metadata:
      type: Utilization 
      value: "60"

We created a ScaledObject with a CPU trigger and set the value to 80%, but the pod did not scale out even though the CPU metric met the target of 80%. We found that the root cause was a globally-configurable tolerance of 0.1 by default, and the pod would only be scaled out if the metric exceeded 88%. However, we believe that this behavior is not very reasonable for the following reasons:

  1. If we set the CPU trigger value to 90%, the final value will be 99%, which means the pod will never scale out even if the CPU is fully utilized.
  2. This global configuration, horizontal-pod-autoscaler-tolerance, can only be set in the control manager, which is difficult to configure in some environments.
  3. This tolerance value is not obvious, and there are no hints in the ScaledObject, which can be misleading to users.

We hope that this configuration can be added to the ScaledObject so that users can configure and check it more easily. Additionally, we have found it difficult to change the controller manager's configuration from the KEDA side. Would it be possible to request Kubernetes community to improve this behavior in HPA based on these KEDA user cases?

Use-Case

No response

Is this a feature you are interested in implementing yourself?

Yes

Anything else?

No response

JorTurFer commented 10 months ago

This is an awesome idea, but it requires support from HPA Controller and sadly it's a discussion opened since Mar 2023: https://github.com/kubernetes/kubernetes/issues/116984

If the HPA Controller supports it, I fully agree with adding support here 😄

ykyr commented 10 months ago

@SpiritZhou Thank you for opening this. Today, we hit a very similar use case. We use KEDA to scale based on many different external metrics. In this example, we used a DD scaler. The idea was simple: if the DD metric reports a value 15 - we expect 15 pods. If the DD metric value is 14 - we expect 14 pods. They should match, basically.

Getting the exact number of pods is not trivial. And it's all due to the HPA 10% tolerance.

Example of ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: example
spec:
  scaleTargetRef:
    name: example-deployment
  minReplicaCount: 1
  maxReplicaCount: 16
  cooldownPeriod: 30
  fallback:
    failureThreshold: 60
    replicas: 16
  triggers:
    - type: datadog
      authenticationRef:
        name: keda-trigger-datadog-auth
        kind: ClusterTriggerAuthentication
      metadata:
        query:
          "avg:custom.node_scale_cnt{env:prod} by {dc}"
        queryValue: "1"

It would be amazing to have tolerance configurable per ScaledObject.

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

tomkerkhove commented 7 months ago

@SpiritZhou Are you willing to contribute https://github.com/kubernetes/kubernetes/issues/116984?

SpiritZhou commented 7 months ago

I can have a try.

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 months ago

This issue has been automatically closed due to inactivity.