Closed mszadkow closed 3 months ago
@tenzen-y this is related to my recent comment https://github.com/kubeflow/training-operator/commit/64e39f2aa8ca51112dc02a93db85a5391cdf6cd7#r145851795
It only reproduces if ttlSecondsAfterFinished
is set
/assign
/triage accepted
@tenzen-y: The label(s) triage/accepted
cannot be applied, because the repository doesn't have them.
/triage accepted
/remove-lifecycle needs-triage
Preempted TFJob status should be suspended and cleaned up.
@mszadkow I guess that the expected behavior is only cleaned up Jobs, right? The training-operator does not suspend Jobs that reached ttl.
@tenzen-y this is related to my recent comment 64e39f2#r145851795
@alculquicondor The training-operator TTL mechanism is not related to the Suspension mechanism. So, the above comment is not related to this bug, right?
They shouldn't be related, but they are.. that's the bug.
When suspending, we call this function:
If there is no TTL, it returns quickly. Otherwise, it goes on to check the completion time, which a suspended job doesn't have. I guess we simply shouldn't be calling CleanupJob
for suspend? It's not doing anything in general.
/remove-label lifecycle/needs-triage
If there is no TTL, it returns quickly. Otherwise, it goes on to check the completion time, which a suspended job doesn't have. I guess we simply shouldn't be calling CleanupJob for suspend? It's not doing anything in general.
That's a good point. I believe that we should just return quickly like this:
func (jc *JobController) CleanupJob(runPolicy *apiv1.RunPolicy, jobStatus apiv1.JobStatus, job interface{}) error {
[...]
ttl := runPolicy.TTLSecondsAfterFinished
if ttl == nil || commonutil.IsSuspended(jobStatus) {
return nil
}
[...]
What happened?
Preemption of TFJob is not done properly and the job is never cleaned up. That might prevent other jobs to be admitted if low on resources.
In order to reproduce: Create TFJob first and any other Job with higher priority as second to trigger a preemption. Both jobs should use the same limited resources that last only for one at the time. TFJob has to have the
runPolicy.ttlSecondsAfterFinished
set above 0.Cleanup never happens as the completion time is nil: https://github.com/kubeflow/training-operator/blame/6900714c39dcb4991b6cf3bc793e73fc7386e478/pkg/controller.v1/common/job.go#L429
What did you expect to happen?
Preempted TFJob status should be suspended and cleaned up.
Environment
Kubernetes version: v1.30.0 Training Operator version: kubeflow/training-operator:v1-855e096 Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍