Open jeevantpant opened 2 months ago
@jeevantpant check if this helps, we were having a similar issue with scale as well.
Hello, At scale, there are 2 configurations that can be affecting you, creating the bottleneck:
For the parallel topic, I'd suggest increasing the current value of KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES 5 to IDK, 20 (and check if it improves and solves, if only improves, increase more) -> https://keda.sh/docs/2.15/operate/cluster/#configure-maxconcurrentreconciles-for-controllers. This will allow more parallel actions reconilling ScaledObjects (if this is the bottleneck)
For the Kubernetes client throttling, you can increase these other paramenters -> https://keda.sh/docs/2.15/operate/cluster/#kubernetes-client-parameters If you are affected by this, you should see messages announcing the rate limit and the waiting time due to it. In this case, I'd recommend increasing them to the double and monitor how it performs, if it's not enough, multiple to the double and check and so on...
There have also been some improvements related with status handling, so upgrading to v2.15 could improve the performance as it reduces significantly the calls to the API server in some cases (if this is the root cause of your case)
Thanks so much @JorTurFer for such insightful suggestions and options for trying out. Your valuable suggestions seemed to have solved our issue which we were facing to scale out the deployments.
I wanted to post the observations and findings when we tried all the above suggestions given by you.
1) After upgrading the keda to v2.15 , we noticed the total time to scaleout all the deployment replicas across all reduce slightly to 50mins (which was previously taking 2hr30mins)
2) Along with the v2.15 upgrade, we also now updated the value of Parallel reconciliations and set KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES to 20 as per your recommendation, - We observed there was no change to the scaling out time during the cron schedule, which seemed to suggest that allowing more parallel actions and reconciling ScaledObjects might not be the bottleneck.
3) Finally along with v2.15 keda version upgrade, next thing I tried was to update the kubernetes client parameters in the operator . below are the parameter values and the observations : the default for qps was 20 and burst was 30, I tried to maintain the same ratio when increasing. the below is a consistent behavior , (tried multiple fresh installs to confirm behavior over few weeks for each case) a) kube-api-qps: 20 -> 40 / kube-api-burst 30 -> 60 this setting significantly impacted the scaling out time where we noticed that
And for all the subsequent scaling windows, the time taken to fully trigger all HPAs to scaleout to desired replicas was reduced to consistently be approx 2mins
b) kube-api-qps: 40 -> 60 / kube-api-burst 60 -> 90
One final question on the above configuration @JorTurFer , if you could please help us with that .
a) DO you think that the following values set for the kube client parameters: [ kube-api-qps: 60 / kube-api-burst: 90 ] would be of any risk or issue when running on a busier cluster - where there is a significantly more traffic to kube-api-server?
b) Have you ever used these high values before for a live setup OR seen this high numbers for these parameters causing any issues/stress in your experience?
Hello The right values for kube-api-x are a bit more than the minimum value which removes the local throttling messages (KEDA operator logs say that there have been thrilling). About the usage, I know that there are companies running over 3K ScaledObjects in almost realtime, so I think that your scenario could be improved:
About using that high values, I know about clusters configured with 600/900 (and even more for 1 case). They depend on the scaler topology, the amount of failures, etc... I think that in a cluster that already has 1k ScaledObjects, the control plane should be big enough to handle those requests (but monitoring is always a good idea)
Report
We observe performance degradation while scaling out together a large number of deployments, say N, via KEDA. We tested scaling behavior for number of scaledobjects, N = 100,200,500,1000,1500,2000. We expect KEDA to scale deployment replicas from 0-->2 during activation window.
NOTE:
Expected Behavior
Actual Behavior
Steps to Reproduce the Problem
`#Scaleobject.yaml
Deployment.yaml
`
Logs from KEDA operator
CRON window timing - Start : 2024-08-05T14:00:00.000+05:30 End : 2024-08-05T19:00:00.000+05:30
KEDA Version
2.13.1
Kubernetes Version
1.28
Platform
Amazon Web Services
Scaler Details
CRON
Anything else?
No response