With pendingPodConditions Keda counts Failed Jobs as Pending

sanjinp commented 10 months ago

Report

We have:

Long running ScaledJobs
Triggered when single message arrives to SQS queue (queueLength: "1")
Using short polling interval (5s) which caused that Keda spins multiple Pods as first ones didn't have a time to delete the message from queue So our solution was to add pendingPodConditions
```
scalingStrategy:
strategy: "accurate"
pendingPodConditions:
- "Ready"
- "PodScheduled"
```
However we have noticed that if some of the Jobs fail (reach backoffLimit) Keda starts counting that Pod towards Pending, and never triggers

Expected Behavior

Keda should trigger additional Pods despite having some of the previous Job failed, and should not count them into Pending.

Actual Behavior

Keda is counting Failed Jobs as Pending

Steps to Reproduce the Problem

Make Jobs that would fail with exit-code non 0
After backoffLimit expires Failed job would not be cleared immediately, causing Keda to count it as Pending
Look at the Keda logs which would show same behaviour as I have pasted in Logs.

Logs from KEDA operator

2023-12-04T12:52:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of running Jobs": 1}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of pending Jobs ": 1}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Effective number of max jobs": 0}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of jobs": 0}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of jobs": 0}

and from kubectl:

object-detection-d7wzn-kdf9t                           0/1     Error     0          10m
object-detection-d7wzn-mrgth                           1/1     Running   0          7m56s

KEDA Version

2.11.1

Kubernetes Version

1.24

Platform

Amazon Web Services

Scaler Details

AWS SQS

Anything else?

Manually clearing the Jobs is helping Keda to trigger new Pods based on a SQS queue message count. As a workaround we have implemented:

ttlSecondsAfterFinished: 30

Which makes sure that Jobs were cleared regardless of Success or Failed, but this brings another 30s of scaling delay when it happens, affecting speed of our system reaction.

Question: Is it possible that when we are using pendingPodConditions we also check Pod status [Running|Failed|...] beside other conditions, and if there are Failed statuses, we deduct it in Pending Pod count. Or we should do that somewhere else? Thanks!

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 months ago

This issue has been automatically closed due to inactivity.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically closed due to inactivity.

sanjinp commented 3 months ago

This one hasn't been addressed yet, unfortunately I cannot reopen it

kedacore / keda