Open jashan05 opened 8 months ago
Hello, Currently KEDA doesn't check the license and I'm not sure about if it should do it. How should KEDA handle the overcommitting? I mean, imagine that you have 100 slots, and you deploy 4 ScaledJob with max 40 for example, which is the preference if all of them need more than 25 replicas?
@JorTurFer Yes you are right if we have a single scaled job spec. But lets consider the following scenario:
No of Scaled Job Spec | License Count |
---|---|
8 with different flavours | |
of images and different | 100 |
demands |
That means I have to set maxReplicaCount = 100 for each scaled job spec as users can use any one of them and it is hard to predict. But this means keda is still querying and queuing 800 pods and if License count is 100 then it is blocking 800 IP's.
Best Regards Jashan Sidhu
Yeah, I get your point, but I still don't see how to solve the overcommiting. Let's say that you have 5 ScaledJobs with max 100 because it's the license count but all of them requires 100 because you are in a peak. How should KEDA balance the requirements between them? I mean, you need 500, but you can have just 100, it means that KEDA has to decide the priorities and weights of each ScaledJob. It's not just an autoscaling decision but managing decision.
Although we could measure the amount of pods across all the ScaledJobs, now imagine that 1 of the ScaledJob is locking all the licenses and then you have another jobs queued for other agents. What should KEDA do here? Killing some jobs to make space for the others? lock them until others finish? I mean, there are several decisions here unrelated with the autoscaling itself
WDYT @tomkerkhove @zroubalik ?
Hello @JorTurFer I think it should lock them until others are finished or when license count available - license count used > 0. Otherwise it is always going to commit to more resources.
Best Regards Jashan Sidhu
I think it should lock them until others are finished
Although all the licenses are locked by a single pool? It could be risky IMHO, but I'd like to see other folks' thoughts. @zroubalik @tomkerkhove @Eldarrin ?
The only option I'd see is that KEDA reports the maximum allowed number of licenses to Kubernetes to prevent it from adding more jobs; if we can even do that
There only possible scenario I can see is that the scalers use a shared state model, but the problem here is that keda is just queuing what ADO pipelines says to queue. ADO says to 10 agents are required, Keda queues 10 agents. If ADO has a license issue saying you can't queue 10 agents then why is ADO stating that 10 agents are required?
So Keda is just doing what its told and any solution we provide is actually just fixing ADO.
@JorTurFer No licenses are not locked by single agent pool. With a single API call we can check used, free license count at the org level.
@tomkerkhove Yes I agree.
@Eldarrin I think Keda is not having the same behaviour as Azure DevOps at the moment. There can be jobs in queue but Azure DevOps always checks licenses to assign jobs to an agent. If licenses are not sufficient jobs will be sitting in a queue. IMHO if keda also does that , that solves the problem. I think to achieve this keda needs to check the license count along with the queue and then decide to add a job or not.
The problem is with state. Keda scalers are stateless. It just checks the length of the queue and creates enough agents for it; it is not for Keda to check whether items should be in the queue. Also, being stateless even if we checked the licence count each agent will spin up to max of license count; this is the same behaviour as you get by just making maxReplicaCount = Licence Count.
HTH
Problem with setting maxReplicaCount = Licence Count is that if you have multiple ( n ) scaled Jobs with different demands then Keda is doing n x maxReplicaCount calls to Azure DevOps to check for queue.
What I think Keda should do is 2 API calls every time, check queued jobs and license details and start the pods accordingly.
e.g to get the license details below is the API call (in python):
def get_license_count(org):
headers={'Authorization': f'Bearer {oauth_token}',
'accept': 'application/json;api-version=7.0-preview',
'Content-type': 'application/json'}
org_license = requests.get(url=f'https://dev.azure.com/{org}/_apis/distributedtask/resourceusageparallelismTag=Private&poolIsHosted=false&includeRunningRequests=true',
headers=headers)
return {'used_count': org_license.usedCount, 'total_license_count': org_license.resourceLimit.totalCount }
I think that makes sense though. Any concerns of adding this call?
Won't we reach the rate limiting? Maybe with an optional parameter can be a good idea
Optional will be good, and it will double the api calls so rate-limits are a concern if you have many scalers variants running
Hello everyone,
Could you please let me know how we can proceed on this. Are there any plans to add this functionality.
Best Regards Jashan
Could you please let me know how we can proceed on this. Are there any plans to add this functionality.
We agreed with the approach, but for the implementation we probably need someone willing to contribute with it :)
Report
We are a centralised team which is providing keda agents for the whole organisation within a single cluster. This means we are scaling a lot of keda jobs. Issue : We have encountered an issue where if an Organisation has parallel license count of 1 and they queue 100 jobs, Keda will already start reserving IP's for all the queued jobs. This cause resource lock and also cause our subnets to run out of IP addresses
Expected Behavior
While polling and queuing pods, keda should respect license count available at the Organization level in Azure DevOps
Actual Behavior
Keda doesn't respect the license count and already assign IP to pods created although azure pipelines cannot process the jobs since available license count is not there to handle all the jobs
Steps to Reproduce the Problem
Logs from KEDA operator
KEDA Version
2.12.0
Kubernetes Version
< 1.26
Platform
Amazon Web Services
Scaler Details
Azure Pipelines
Anything else?
No response