Job-based Service Bus Scaler scales to too many instances

eugen-nw commented 1 year ago

Report

Say that I configure KEDA with minReplicaCount > 0. If I send Messages to the Queue, that causes KEDA to create as many new Pods as how many Messages there are in the Queue, with no regard to the count of Jobs that are always running, i.e. those created by the minReplicaCount > 0,

Expected Behavior

Let's say that I configure KEDA to have 2 Jobs running permanently. If I send 5 Messages to the Queue, I'd expect KEDA to create only 3 new Pods Instead it is creating 5 new Pods, so they match the count of Messages in the Queue. Below is the scaling behavior that the documentation at https://keda.sh/docs/2.9/concepts/scaling-jobs/ states.

Actual Behavior

Please see above.

Steps to Reproduce the Problem

Configure a KEDA Job deployment in a manner similar to the script below.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: aks-aci-boldiq-workforce-gozen-dev
labels:
app: aks-aci-boldiq-workforce-gozen-dev
deploymentName: aks-aci-boldiq-workforce-gozen-dev
spec:
jobTargetRef:
template:
  spec:
    containers:  # this section is identical as for a "kind: Deployment"
    - image: <removed>
      imagePullPolicy: Always
      name: boldiq-workforce-gozen-dev
      resources:
        requests:
          memory: 8G
          cpu: 4
        limits:
          memory: 8G
          cpu: 4
      env:
      - name: KEDA_SERVICEBUS_CONNECTIONSTRING_GOZEN_DEV
        value: <removed>
    nodeSelector:
      kubernetes.io/os: windows
    tolerations:
    - key: virtual-kubelet.io/provider
      operator: Exists
    - key: azure.com/aci
      effect: NoSchedule
    imagePullSecrets:
      - name: docker-registry-secret
    nodeName: virtual-kubelet
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 0
pollingInterval: 1  # 1 second polling for max. responsiveness
minReplicaCount: 2  # keeping two instances running permanently in order to improve low loads' performance
maxReplicaCount: 10
triggers:
- type: azure-servicebus
#    metricType: Value // The default AverageValue with messageCount: '1' starts up a new Container for each Message in the Queue.  We want that for responsiveness.
metadata:
  queueName: gozen-dev-requests
  connectionFromEnv: KEDA_SERVICEBUS_CONNECTIONSTRING_GOZEN_DEV
  messageCount: '1'

Deploy the script and check the count of Pods created. Should be 2.
Send N Messages into the Queue.
Check the count of Pods created. It will be N + 2.

Logs from KEDA operator

Please email edaroczy@boldiq.com for the .ZIP file.

KEDA Version

2.10.1

Kubernetes Version

1.25

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

AKS 1.25.6 KEDA 2.10.2 The Containers run on the virtual-node-aci-linux virtual node.

JorTurFer commented 1 year ago

Hi I believe that the problem could be related with the short pollingInterval and the pod statuses. As KEDA is checking it every second, maybe pods aren't in a running state and KEDA thinks that there are missing jobs. You can try increasing the pollingInterval or setting more states in pendingPodConditions

eugen-nw commented 1 year ago

pollingInterval should have no relationship to the count of Pods that are already running. If I have 2 Pods already running and 5 Messages in the Queue, then I need the scale-out to fire up only 3 new Pods.

JorTurFer commented 1 year ago

Could you enable the debug logs and share them? The operator logs in debug expose the queue length and the current job count

eugen-nw commented 1 year ago

Please instruct on how should I go about enabling the debug logs. More than gladly to do so.

Thank you, Eugen

Diese Nachricht wurde von meinem iPhone gesendet.

Am 5/24/23 um 2:58 AM schrieb Jorge Turrado Ferrero @.***>:

Could you enable the debug logs and share them? The operator logs in debug expose the queue length and the current job count

— Reply to this email directly, view it on GitHubhttps://github.com/kedacore/keda/issues/4554#issuecomment-1560816073, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADT64RVVP7TYNZR5GOGP37TXHXLULANCNFSM6AAAAAAYIKM35Q. You are receiving this because you authored the thread.Message ID: @.***>

JorTurFer commented 1 year ago

https://github.com/kedacore/keda/issues/4541#issuecomment-1553724189

eugen-nw commented 1 year ago

I have the bandwidth now to address this issue. What would you like me to do precisely, perhaps the steps below?

Set in the deployment script the minReplicaCount of 1 and maxReplicaCount parameters for the Job and then deploy the script.
Verify that the minReplicaCount Pod of 1 had started up.
Send a Message in the Queue and verify that there are 2 Pods running instead of one.
Provide the current log of the keda-operator-* Pod that resides in the keda namespace.

The behavior I'd expect to have is that if I already have a Job running and I send a Message into the Queue, there won't be a second Job starting up but have the currently running Job handling that one Message.

JorTurFer commented 1 year ago

3. Send a Message in the Queue and verify that there are 2 Pods running instead of one.

I think that this shouldn't happen.

The behavior I'd expect to have is that if I already have a Job running and I send a Message into the Queue, there won't be a second Job starting up but have the currently running Job handling that one Message.

This is exactly the behavior I'd expect. Isn't this happening?

eugen-nw commented 1 year ago

@JorTurFer No it does not happen. If have on Pod running - as per the minReplicaCount setting - and then I send a Message I see the second Pod starting up.

I've tried it as well with minReplicatCount set to 2 and sending 5 Messages. The end result was that I got 7 Pods running whereas only 5 would have been sufficient to process the 5 Messages.

JorTurFer commented 1 year ago

@zroubalik , @tomkerkhove , Is this behavior intended and I'm missing something or is this a bug? I have checked the e2e tests and it's coverting this scenario

eugen-nw commented 1 year ago

I thought about this a bit more and it may rather be a feature than a bug. Let's say that I configure a ScaledJob to have a minReplicaCount of 4. By this I express my desire to always have 4 Jobs on stand-by, ready to receive Messages. 2 Messages pop up, so two of my initial 4 Jobs are busy processing them, and by doing so those two are no longer available. In response to that, the ScaledJob starts up two new Jobs immediately, in order to ensure that 4 Jobs will be available soon.

Does this reasoning sounds right to you guys?

JorTurFer commented 1 year ago

Does this reasoning sounds right to you guys?

I thought so, that's why I asked other teammates because that's the behavior covered by the e2e tests. Maybe it's just a documentation gap, but I'm not sure

eugen-nw commented 1 year ago

Thank you. Let's see what response we'll receive.

However, since there are tests that test the behavior, it may be safe to update the documentation. And the behavior is indeed present, I'd tested it several times in the past two weeks and it does work very well :-))

zroubalik commented 1 year ago

If you set minReplicas for ScaledJob, then it is basically a minimum number of jobs (a base) anything else should trigger more jobs. see the PR: https://github.com/kedacore/keda/issues/3426

eugen-nw commented 1 year ago

Thanks very much @zroubalik!

Would it be possible to enhance the documentation of minReplicaCount at https://keda.sh/docs/2.9/concepts/scaling-jobs/ to explain the scale-out behavior dictated by the minReplicaCount parameter? In the current state of the documentation it explains only the fact that minReplicaCount Jobs will be created by default.

JorTurFer commented 1 year ago

Would it be possible to enhance the documentation of minReplicaCount at keda.sh/docs/2.9/concepts/scaling-jobs to explain the scale-out behavior dictated by the minReplicaCount parameter?

It'd be amazing because it's true that it could be a bit confusing. Would you open a PR in docs with the change?

eugen-nw commented 1 year ago

I'll give it a try. My first open source contribution...

zroubalik commented 1 year ago

I'll give it a try. My first open source contribution...

It's never too late to start 😄 Just fork the docs repo, create a new branch and add the information and submit the PR. You might take some info or diagrams from the PR/issue I linked. If you find that useful. Thanks 🙏

eugen-nw commented 1 year ago

Done: https://github.com/kedacore/keda-docs/pull/1144

LewisJackson1 commented 1 year ago

@JorTurFer @zroubalik we were just reading the docs kindly added by @eugen-nw and this really confused me. I can understand that someone may want this behaviour, but it feels like the expected behaviour here:

Let's say that I configure KEDA to have 2 Jobs running permanently. If I send 5 Messages to the Queue, I'd expect KEDA to create only 3 new Pods Instead it is creating 5 new Pods, so they match the count of Messages in the Queue.

is going to be a more common use case, or at least desired by some users.

Scaling out too much will cost us a considerable amount of money as we're processing videos on GPU Nodes.

eugen-nw commented 1 year ago

You can limit the max. desired / allowed count of containers in the .yaml script. That will limit your expenses. In your example you will get 5 Jobs created to handle your 5 Messages + 2 other Jobs on stand-by to handle whatever may come in. All of these when the 5 new Pods are up and functional.

My scale-out scenario has to accommodate sudden bursts in demand. The current operation mode enables me to have N containers (more or less) ready to immediately handle a burst.

LewisJackson1 commented 1 year ago

You can limit the max. desired / allowed count of containers in the .yaml script.

No matter what we set the max to we're always going to be spinning up containers for no reason. If two items come into our queue we don't need to spin up two additional Jobs with their own GPU Nodes and pay the minimum charge for that when we have two Jobs ready for them. If we set the maximum to the same as the minimum this wouldn't happen but we also would not be autoscaling.

My scale-out scenario has to accommodate sudden bursts in demand. The current operation mode enables me to have N containers (more or less) ready to immediately handle a burst.

I understand that this is a desirable use case for you and some others, but I doubt it's what most people would think the behaviour is when they see this parameter (which is why this issue was created).

JorTurFer commented 1 year ago

Hi @LewisJackson1 So, you would like to have minReplicaCount always (let's say 2 for example), but in case of having jobs you want that one of those 2 is who manages the job, not having extra instances ready for working, right? In that case, you want pre-warmed instances for the first jobs, but for next jobs, is waiting acceptable for them? I mean, you already have some ready pods to process those jobs when there isn't any pending job. Probably I'm missing something important in the middle because I don't get your use case :(

If waiting is not a problem and you prefer to save as much money as possible, you can set minReplicaCount: 0 (or just not set anything) and you will have 0 pending jobs

LewisJackson1 commented 1 year ago

So, you would like to have minReplicaCount always (let's say 2 for example), but in case of having jobs you want that one of those 2 is who manages the job, not having extra instances ready for working, right?

Hello @JorTurFer, I'm not sure that I understand the question here, apologies!

In that case, you want pre-warmed instances for the first jobs, but for next jobs, is waiting acceptable for them? I mean, you already have some ready pods to process those jobs when there isn't any pending job.

Yeah, if additional jobs came in after the minimum replicas then they would have to wait for scaling and that's acceptable.

I guess the simplest way that I can think of to illustrate this is to compare the behaviour to a ScaledObject. If we configure a ScaledObject to track an SQS queue with 2 minimum replicas and 2 items enter the queue, the ScaledObject does not spin up 2 more Pods - is that correct?

We're looking at migrating a queue processor from ScaledObject to ScaledJob and I'm just finding this inconsistency between the two defined behaviours quite weird. I think that we could work around this with a static Deployment that would always be warm, then set the ScaledJob to track additional queue items?

JorTurFer commented 1 year ago

We're looking at migrating a queue processor from ScaledObject to ScaledJob and I'm just finding this inconsistency between the two defined behaviours quite weird.

Yes, you are right and they aren't consistent, but they aren't comparable either IMHO. I mean, in ScaledObject, the workload can process multiple items, so just after finishing with a message, the workload starts with the next message without any cooldown. In ScaledJob, your job usually takes 1 single message and ends, so after finishing the current message, the pod finishes and KEDA spin up another job, which isn't instant. That's why the minimum replicas for ScaledJob is the minimum replicas ready to work (idle).

This is an interesting discussion, and maybe the best place is in a GH discussion, where other maintainers and any other community folk can give their 2 cents. Would you open a discussion about this?

In any case, for solving your use case, you could create your REST API (or gRPC Server) with the business logic that you want, and use Metrics API Scaler (or External Scaler) to connect KEDA to it. With this approach, you could set minReplicaCount:0 and provide from your server the desired amount of instances on each moment.

eugen-nw commented 1 year ago

You may want to give the Job scale-out method some time to settle. Spend some time experimenting with both scale-out alternatives. Use Linux containers (vs. Windows) for faster Pod start-up times. Jobs will always handle totally long processings should that be a concern. With ScaledObject scale-out you'll pay for unused capacity. The best scenario is to have no Pods running 24 x 7 and use ScaledJob to fire up Pods whenever necessary; should that setup accommodate your use cases.

I operate in Azure cloud. Taking scale-out to the next level, I run no Pods in the Azure Kubernetes cluster but delegate them to run in the Azure Container Instances service by using a Virtual Kubelet. Thus we pay only for each second a Pod runs + we can scale out indefinitely.

LewisJackson1 commented 1 year ago

This is an interesting discussion, and maybe the best place is in a GH discussion, where other maintainers and any other community folk can give their 2 cents. Would you open a discussion about this?

I've opened a discussion here: https://github.com/kedacore/keda/discussions/4885

In ScaledJob, your job usually takes 1 single message and ends, so after finishing the current message, the pod finishes and KEDA spin up another job, which isn't instant. That's why the minimum replicas for ScaledJob is the minimum replicas ready to work (idle).

I feel like it is quite an opinionated stance for the scaler to take to assume that the user would want to have a buffer here as their Jobs are slow to start-up/terminate. I don't think there's that much difference between a Job and a persistent Pod, they both have a start-up latency so the over-provisioning behaviour could also be useful there. I can understand that this might be desirable for some people, and it'd be great to have this behaviour available for both ScaledJob and ScaledObject as an opt-in/out.

kedacore / keda