jeevantpant commented 2 months ago

Report

We observe performance degradation while scaling out together a large number of deployments, say N, via KEDA. We tested scaling behavior for number of scaledobjects, N = 100,200,500,1000,1500,2000. We expect KEDA to scale deployment replicas from 0-->2 during activation window.

In the below testing, only CRON based external scaler is being used to observed performance from scaling to/from 0 to constant Desired replicas count and vice versa.
We notice that when number of ScaledObjects, N is 700<N<1250, it takes a significant amount of time to completely scale out all the target deployment replicas to come up to desired number of replicas (only 1->2 scaling). Approximately 2.5hrs.
We see that KEDA is taking ~5mins to activate all ScaledObjects and bring replicas of all deployments from 0-->1,but its KEDA/HPA taking lot of time to scale the replicas form 1-->2.

NOTE:

We have ensured that we have enough compute and all resourcequotas in surplus, to ensure that this is not a resource crunch.
We have validated the behavior when N = 1500 or even 2000, all deployments are able to scale up within ~14mins - 15 mins which is expected considering the node scaleup and pod going to Running state.
We only see this anomaly when number of scaledObjects and deployments were within 700 to 1250

Expected Behavior

- Every HPA object should make a call to the KEDA metricsapi server every 15s by default to fetch metrics starting from the CRON start window time.
- KEDA metricsapi server logs the request made by HPA, and make a call internally to the KEDA operator to get the actual external metric which is observed in the KEDA operator grpc logs.
- Finally the KEDA metricsapi server also logs when the metrics are successfully calculated and exposed by the KEDA operator.
    - Every scaledObject should be reconciled every 30s by KEDA operator.

Actual Behavior

- Few of the HPAs are making a call to the KEDA metricsapi server after 2hr 30mins to fetch metrics after the CRON start window time.
- We see a latency of around 1 min during the external metric generation and exposing by handshake between KEDA operator and the KEDA metricsapi server. 
- We observe pressure in KEDA operator where we see the reconciliation or polling activity by KEDA operator taking >30s every poll.

Steps to Reproduce the Problem

Create the below Scaledobject targeting a simple deployment having one container.

`#Scaleobject.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: app-scaledobjecttxy-10
  namespace: test-ns
spec:
  scaleTargetRef:
    name: app-deployedtxy-10
  minReplicaCount: 0
  advanced:
    restoreToOriginalReplicaCount: true
  triggers:
    - type: cron
      metadata:
        timezone: Asia/Kolkata
        start: 00 14 * * * # At every 2pm IST
        end: 00 19 * * * # At every 7pm IST
        desiredReplicas: "2"
      name: "cron-sample"

Deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployedtxy-10
  namespace: test-ns
  labels:
    app: app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      name: app
      labels:
        app: app
    spec:
      securityContext:
        runAsUser: 1001
        runAsGroup: 1001
      imagePullSecrets:
        - name: test
      serviceAccount: test
      containers:
        - name: app-cont-tx
          image: test-image
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $(date -u); sleep 30; done"]
          resources:
            requests:
              memory: "700Mi"
              cpu: "30m"
            limits:
              memory: "700Mi"
              cpu: "30m"

`

We need to create N number of scaledobjects , where number of scaledobjects/deployments in this case N = 1050. (We saw any value between 700 and 1250 was showing this behavior and can be used for reproducing this bug.)
Please make sure there is no resource crunch while scaling and make sure we have enough compute for all 1050 deployments scaling up to 2 each (worker nodes and compute with surplus namespace resourcequota)

Logs from KEDA operator

CRON window timing - Start : 2024-08-05T14:00:00.000+05:30 End : 2024-08-05T19:00:00.000+05:30

We can see the first request logged at 2024-08-05T16:33:19.216+05:30 for a scaled object with issue : app-scaledobjecttxy-10

[keda-operator-reconcile-logs.json](https://github.com/user-attachments/files/16579557/keda-operator-reconcile-logs.json)
[keda-operator-logs.csv](https://github.com/user-attachments/files/16579559/keda-operator-logs.csv)
[keda-metricsapi-server-logs.csv](https://github.com/user-attachments/files/16579560/keda-metricsapi-server-logs.csv)

KEDA Version

2.13.1

Kubernetes Version

1.28

Platform

Amazon Web Services

Scaler Details

CRON

Anything else?

No response

deefreak commented 2 months ago

@jeevantpant check if this helps, we were having a similar issue with scale as well.

https://github.com/kedacore/keda/issues/5624

JorTurFer commented 2 months ago

Hello, At scale, there are 2 configurations that can be affecting you, creating the bottleneck:

Parallel reconciliations
Kubernetes client throttling

For the parallel topic, I'd suggest increasing the current value of KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES 5 to IDK, 20 (and check if it improves and solves, if only improves, increase more) -> https://keda.sh/docs/2.15/operate/cluster/#configure-maxconcurrentreconciles-for-controllers. This will allow more parallel actions reconilling ScaledObjects (if this is the bottleneck)

For the Kubernetes client throttling, you can increase these other paramenters -> https://keda.sh/docs/2.15/operate/cluster/#kubernetes-client-parameters If you are affected by this, you should see messages announcing the rate limit and the waiting time due to it. In this case, I'd recommend increasing them to the double and monitor how it performs, if it's not enough, multiple to the double and check and so on...

JorTurFer commented 2 months ago

There have also been some improvements related with status handling, so upgrading to v2.15 could improve the performance as it reduces significantly the calls to the API server in some cases (if this is the root cause of your case)

jeevantpant commented 4 days ago

Thanks so much @JorTurFer for such insightful suggestions and options for trying out. Your valuable suggestions seemed to have solved our issue which we were facing to scale out the deployments.

I wanted to post the observations and findings when we tried all the above suggestions given by you.

1) After upgrading the keda to v2.15 , we noticed the total time to scaleout all the deployment replicas across all reduce slightly to 50mins (which was previously taking 2hr30mins)

2) Along with the v2.15 upgrade, we also now updated the value of Parallel reconciliations and set KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES to 20 as per your recommendation, - We observed there was no change to the scaling out time during the cron schedule, which seemed to suggest that allowing more parallel actions and reconciling ScaledObjects might not be the bottleneck.

3) Finally along with v2.15 keda version upgrade, next thing I tried was to update the kubernetes client parameters in the operator . below are the parameter values and the observations : the default for qps was 20 and burst was 30, I tried to maintain the same ratio when increasing. the below is a consistent behavior , (tried multiple fresh installs to confirm behavior over few weeks for each case) a) kube-api-qps: 20 -> 40 / kube-api-burst 30 -> 60 this setting significantly impacted the scaling out time where we noticed that

ONLY for the First scaling out window (1st cron window right after freshly installed scaledobjects and Deployments) we noticed that amount of time to completely scale out all the deployment come up to desired number of replicas was approx 30mins (which is still considerably large delay)
And for all the subsequent scaling windows, the time taken to fully trigger all HPAs to scaleout to desired replicas was reduced to consistently be approx 2mins

b) kube-api-qps: 40 -> 60 / kube-api-burst 60 -> 90
- For all the scaling windows, the time taken to fully trigger all HPAs to scaleout to desired replicas was reduced to consistently be around 3-5mins which is within our range of delay for scaleout. Pls note this time is only for all the HPAs to be updated to the desired count number (1-->2) , And DOES NOT include the node scaling out time as well. If we add that as well, then we notice that the total scaling time is around 15-17 mins consistently,

One final question on the above configuration @JorTurFer , if you could please help us with that .

we had done this testing on a relatively empty cluster where there are not other types of k8s resources like controllers, ingresses, webhooks etc (anything which will hit kube-api-server basically and cause load).

a) DO you think that the following values set for the kube client parameters: [ kube-api-qps: 60 / kube-api-burst: 90 ] would be of any risk or issue when running on a busier cluster - where there is a significantly more traffic to kube-api-server?

b) Have you ever used these high values before for a live setup OR seen this high numbers for these parameters causing any issues/stress in your experience?

JorTurFer commented 4 days ago

Hello The right values for kube-api-x are a bit more than the minimum value which removes the local throttling messages (KEDA operator logs say that there have been thrilling). About the usage, I know that there are companies running over 3K ScaledObjects in almost realtime, so I think that your scenario could be improved:

do you still see throttling messages?
do you see throttling in KEDA pods in general?

About using that high values, I know about clusters configured with 600/900 (and even more for 1 case). They depend on the scaler topology, the amount of failures, etc... I think that in a cluster that already has 1k ScaledObjects, the control plane should be big enough to handle those requests (but monitoring is always a good idea)

kedacore / keda

Performance Degradation while scaling out large number of Deployments, 700<N<1250 #6063