Open asm582 opened 1 year ago
Are you sure the mismatch is the problem? The fact that this line Insufficient number of Running and Completed pods, minimum=1, running=0, completed=0.
is showing completed=0
is why the completed AppWrapper is being requeued. This was the issue we had yesterday. So are you seeing that when the statuses are not the same, the AppWrapper losses the count of pods running or completed?
I think the jobs here are short sleep 10, when they complete very quickly and at scale the updates are not propagated accurately, causing preemption thread to take down appwrappers that have finished execution
When
queuejobstate
andState
do not agree any AW that has completed pods is preempted. Find completed pods tied to an AW is flaky at scale when thousands are AWs are preempted. Here is an example: