kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.38k stars 1.06k forks source link

With pendingPodConditions Keda counts Failed Jobs as Pending #5264

Closed sanjinp closed 3 months ago

sanjinp commented 10 months ago

Report

We have:

Expected Behavior

Keda should trigger additional Pods despite having some of the previous Job failed, and should not count them into Pending.

Actual Behavior

Keda is counting Failed Jobs as Pending

Steps to Reproduce the Problem

  1. Make Jobs that would fail with exit-code non 0
  2. After backoffLimit expires Failed job would not be cleared immediately, causing Keda to count it as Pending
  3. Look at the Keda logs which would show same behaviour as I have pasted in Logs.

Logs from KEDA operator

2023-12-04T12:52:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of running Jobs": 1}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of pending Jobs ": 1}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Effective number of max jobs": 0}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of jobs": 0}
2023-12-04T12:52:17Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "object-detection", "scaledJob.Namespace": "prod", "Number of jobs": 0}

and from kubectl:

object-detection-d7wzn-kdf9t                           0/1     Error     0          10m
object-detection-d7wzn-mrgth                           1/1     Running   0          7m56s

KEDA Version

2.11.1

Kubernetes Version

1.24

Platform

Amazon Web Services

Scaler Details

AWS SQS

Anything else?

Manually clearing the Jobs is helping Keda to trigger new Pods based on a SQS queue message count. As a workaround we have implemented:

ttlSecondsAfterFinished: 30

Which makes sure that Jobs were cleared regardless of Success or Failed, but this brings another 30s of scaling delay when it happens, affecting speed of our system reaction.

Question: Is it possible that when we are using pendingPodConditions we also check Pod status [Running|Failed|...] beside other conditions, and if there are Failed statuses, we deduct it in Pending Pod count. Or we should do that somewhere else? Thanks!

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 months ago

This issue has been automatically closed due to inactivity.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically closed due to inactivity.

sanjinp commented 3 months ago

This one hasn't been addressed yet, unfortunately I cannot reopen it