kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.02k stars 1.01k forks source link

Keda polling doesn't respect license count while queuing azure pipelines #5573

Open jashan05 opened 3 months ago

jashan05 commented 3 months ago

Report

We are a centralised team which is providing keda agents for the whole organisation within a single cluster. This means we are scaling a lot of keda jobs. Issue : We have encountered an issue where if an Organisation has parallel license count of 1 and they queue 100 jobs, Keda will already start reserving IP's for all the queued jobs. This cause resource lock and also cause our subnets to run out of IP addresses

Expected Behavior

While polling and queuing pods, keda should respect license count available at the Organization level in Azure DevOps

Actual Behavior

Keda doesn't respect the license count and already assign IP to pods created although azure pipelines cannot process the jobs since available license count is not there to handle all the jobs

Steps to Reproduce the Problem

  1. Create an agent pool in Azure DevOps and corresponding namespace in cluster with keda
  2. Have a parallel license count set to a number e.g 1 in azure devops organisation for Self hosted agents
  3. Queue a dummy pipeline 10 times and put a sleep in there e.g for 10 mins

Logs from KEDA operator

example

KEDA Version

2.12.0

Kubernetes Version

< 1.26

Platform

Amazon Web Services

Scaler Details

Azure Pipelines

Anything else?

No response

JorTurFer commented 3 months ago

Hello, Currently KEDA doesn't check the license and I'm not sure about if it should do it. How should KEDA handle the overcommitting? I mean, imagine that you have 100 slots, and you deploy 4 ScaledJob with max 40 for example, which is the preference if all of them need more than 25 replicas?

jashan05 commented 3 months ago

@JorTurFer Yes you are right if we have a single scaled job spec. But lets consider the following scenario:

No of Scaled Job Spec License Count
8 with different flavours
of images and different 100
demands

That means I have to set maxReplicaCount = 100 for each scaled job spec as users can use any one of them and it is hard to predict. But this means keda is still querying and queuing 800 pods and if License count is 100 then it is blocking 800 IP's.

Best Regards Jashan Sidhu

JorTurFer commented 3 months ago

Yeah, I get your point, but I still don't see how to solve the overcommiting. Let's say that you have 5 ScaledJobs with max 100 because it's the license count but all of them requires 100 because you are in a peak. How should KEDA balance the requirements between them? I mean, you need 500, but you can have just 100, it means that KEDA has to decide the priorities and weights of each ScaledJob. It's not just an autoscaling decision but managing decision.

Although we could measure the amount of pods across all the ScaledJobs, now imagine that 1 of the ScaledJob is locking all the licenses and then you have another jobs queued for other agents. What should KEDA do here? Killing some jobs to make space for the others? lock them until others finish? I mean, there are several decisions here unrelated with the autoscaling itself

WDYT @tomkerkhove @zroubalik ?

jashan05 commented 3 months ago

Hello @JorTurFer I think it should lock them until others are finished or when license count available - license count used > 0. Otherwise it is always going to commit to more resources.

Best Regards Jashan Sidhu

JorTurFer commented 3 months ago

I think it should lock them until others are finished

Although all the licenses are locked by a single pool? It could be risky IMHO, but I'd like to see other folks' thoughts. @zroubalik @tomkerkhove @Eldarrin ?

tomkerkhove commented 3 months ago

The only option I'd see is that KEDA reports the maximum allowed number of licenses to Kubernetes to prevent it from adding more jobs; if we can even do that

Eldarrin commented 3 months ago

There only possible scenario I can see is that the scalers use a shared state model, but the problem here is that keda is just queuing what ADO pipelines says to queue. ADO says to 10 agents are required, Keda queues 10 agents. If ADO has a license issue saying you can't queue 10 agents then why is ADO stating that 10 agents are required?

So Keda is just doing what its told and any solution we provide is actually just fixing ADO.

jashan05 commented 3 months ago

@JorTurFer No licenses are not locked by single agent pool. With a single API call we can check used, free license count at the org level.

@tomkerkhove Yes I agree.

@Eldarrin I think Keda is not having the same behaviour as Azure DevOps at the moment. There can be jobs in queue but Azure DevOps always checks licenses to assign jobs to an agent. If licenses are not sufficient jobs will be sitting in a queue. IMHO if keda also does that , that solves the problem. I think to achieve this keda needs to check the license count along with the queue and then decide to add a job or not.

Eldarrin commented 3 months ago

The problem is with state. Keda scalers are stateless. It just checks the length of the queue and creates enough agents for it; it is not for Keda to check whether items should be in the queue. Also, being stateless even if we checked the licence count each agent will spin up to max of license count; this is the same behaviour as you get by just making maxReplicaCount = Licence Count.

HTH

jashan05 commented 3 months ago

Problem with setting maxReplicaCount = Licence Count is that if you have multiple ( n ) scaled Jobs with different demands then Keda is doing n x maxReplicaCount calls to Azure DevOps to check for queue.

What I think Keda should do is 2 API calls every time, check queued jobs and license details and start the pods accordingly.

e.g to get the license details below is the API call (in python):

def get_license_count(org):
    headers={'Authorization': f'Bearer {oauth_token}',
             'accept': 'application/json;api-version=7.0-preview',
             'Content-type': 'application/json'}
    org_license = requests.get(url=f'https://dev.azure.com/{org}/_apis/distributedtask/resourceusageparallelismTag=Private&poolIsHosted=false&includeRunningRequests=true',
                             headers=headers)
    return {'used_count': org_license.usedCount, 'total_license_count': org_license.resourceLimit.totalCount }
tomkerkhove commented 2 months ago

I think that makes sense though. Any concerns of adding this call?

JorTurFer commented 2 months ago

Won't we reach the rate limiting? Maybe with an optional parameter can be a good idea

Eldarrin commented 2 months ago

Optional will be good, and it will double the api calls so rate-limits are a concern if you have many scalers variants running

jashan05 commented 1 month ago

Hello everyone,

Could you please let me know how we can proceed on this. Are there any plans to add this functionality.

Best Regards Jashan

JorTurFer commented 6 days ago

Could you please let me know how we can proceed on this. Are there any plans to add this functionality.

We agreed with the approach, but for the implementation we probably need someone willing to contribute with it :)