Open mimowo opened 1 week ago
cc @mbobrovskyi @PBundyra
/assign
@mimowo Thanks for creating a ticket tracking this.
I observed some unexcepted behaviors after applying for the workaround, commenting here: In the custom controller, the requeue interval after setting to "Retry" is 5 seconds, however from the log, I see 28 times in 2 seconds the reconciler tries to set the AC status from Pending to Retry. Seems like although Kueue evicts workload after AC is in retry status, it un-evicts it and reserves quota right away, so the status is back to Pending, Then Reconciler sets it back to Retry... It's like a race condition.
What happened:
A workload can get stuck forever with Evicted=True if the external controller sets state of the admission check to Retry while Evicted=True.
The scenario does not seem to happen consistently, but this is the root cause of the issue here: https://github.com/kubernetes-sigs/kueue/discussions/3365#discussioncomment-11259602. As a consequence the workload could not get re-admitted.
The issue has a workaround at the level of external admission check, to guard setting the Retry for the AC state whilst Evicted=True, as here.
Then Kueue flips the Retry to Pending, but it is stuck with Evicted=True forever. This is the final status:
Some observations: the workload get re-admitted when we manually set
Evicted=False
- I expect Kueue should do it on its own.What you expected to happen:
I think Kueue should be able to recover from the situation on its own, and finalize eviction of the workload, allowing it to get re-admitted.
How to reproduce it (as minimally and precisely as possible):
More details in the issue or @leipanhz can share, but basically the external AC was setting Retry while Kueue was evicting the workload. I think we should be able to reproduce this with integration tests.