kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.47k stars 262 forks source link

Workload can get stuck indefinitely when using external AdmissionCheck #3543

Open mimowo opened 1 week ago

mimowo commented 1 week ago

What happened:

A workload can get stuck forever with Evicted=True if the external controller sets state of the admission check to Retry while Evicted=True.

The scenario does not seem to happen consistently, but this is the root cause of the issue here: https://github.com/kubernetes-sigs/kueue/discussions/3365#discussioncomment-11259602. As a consequence the workload could not get re-admitted.

The issue has a workaround at the level of external admission check, to guard setting the Retry for the AC state whilst Evicted=True, as here.

Then Kueue flips the Retry to Pending, but it is stuck with Evicted=True forever. This is the final status:

Status:
  Admission Checks:
    Last Transition Time:  2024-11-14T18:09:17Z
    Message:               The workload is pending on Prefetch Admission Check
    Name:                  custom-ac
    State:                 Pending
  Conditions:
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               The workload has failed admission checks
    Observed Generation:   1
    Reason:                Pending
    Status:                False
    Type:                  QuotaReserved
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               At least one admission check is false
    Observed Generation:   1
    Reason:                AdmissionCheck
    Status:                True
    Type:                  Evicted
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               The workload backoff was finished
    Observed Generation:   1
    Reason:                BackoffFinished
    Status:                True
    Type:                  Requeued
Events:
  Type     Reason                      Age   From                       Message
  ----     ------                      ----  ----                       -------
  Normal   QuotaReserved               13m   kueue-admission            Quota reserved in ClusterQue
ue cluster-queue, wait time since queued was 0s
  Normal   EvictedDueToAdmissionCheck  13m   kueue-workload-controller  At least one admission check
 is false
  Warning  Pending                     13m   kueue-admission            The workload has failed admi
ssion checks

Some observations: the workload get re-admitted when we manually set Evicted=False - I expect Kueue should do it on its own.

What you expected to happen:

I think Kueue should be able to recover from the situation on its own, and finalize eviction of the workload, allowing it to get re-admitted.

How to reproduce it (as minimally and precisely as possible):

More details in the issue or @leipanhz can share, but basically the external AC was setting Retry while Kueue was evicting the workload. I think we should be able to reproduce this with integration tests.

mimowo commented 1 week ago

cc @mbobrovskyi @PBundyra

mszadkow commented 1 week ago

/assign

leipanhz commented 1 week ago

@mimowo Thanks for creating a ticket tracking this.

I observed some unexcepted behaviors after applying for the workaround, commenting here: In the custom controller, the requeue interval after setting to "Retry" is 5 seconds, however from the log, I see 28 times in 2 seconds the reconciler tries to set the AC status from Pending to Retry. Seems like although Kueue evicts workload after AC is in retry status, it un-evicts it and reserves quota right away, so the status is back to Pending, Then Reconciler sets it back to Retry... It's like a race condition.