kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

Broken preemption on TFJob with non default runPolicy.ttlSecondsAfterFinished #2239

Closed mszadkow closed 3 months ago

mszadkow commented 3 months ago

What happened?

Preemption of TFJob is not done properly and the job is never cleaned up. That might prevent other jobs to be admitted if low on resources.

│ Status:                                                                                                                                                        │
│   Conditions:                                                                                                                                                  │
│     Last Transition Time:  2024-08-28T10:21:23Z                                                                                                                │
│     Last Update Time:      2024-08-28T10:21:23Z                                                                                                                │
│     Message:               TFJob default/tensorflow-dist-mnist2 is running.                                                                                    │
│     Reason:                TFJobRunning                                                                                                                        │
│     Status:                True                                                                                                                                │
│     Type:                  Running                                                                                                                             │
│   Replica Statuses:                                                                                                                                            │
│     PS:                                                                                                                                                        │
│       Active:  1                                                                                                                                               │
│     Worker:                                                                                                                                                    │
│       Active:  1                                                                                                                                               │
│   Start Time:  2024-08-28T10:21:22Z  

│   Normal  CreatedWorkload          88s   kubeflow.org/tfjob-kueue-controller  Created Workload: default/tfjob-tensorflow-dist-mnist2-c6ba7                     │
│   Normal  Started                  88s   kubeflow.org/tfjob-kueue-controller  Admitted by clusterQueue cluster-queue                                           │
│   Normal  SuccessfulCreatePod      88s   tfjob-controller                     Created pod: tensorflow-dist-mnist2-worker-0                                     │
│   Normal  SuccessfulCreateService  88s   tfjob-controller                     Created service: tensorflow-dist-mnist2-worker-0                                 │
│   Normal  SuccessfulCreatePod      88s   tfjob-controller                     Created pod: tensorflow-dist-mnist2-ps-0                                         │
│   Normal  SuccessfulCreateService  88s   tfjob-controller                     Created service: tensorflow-dist-mnist2-ps-0                                     │
│   Normal  Stopped                  64s   kubeflow.org/tfjob-kueue-controller  Preempted to accommodate a workload (UID: c8eb139e-b434-41b3-9b30-4292ec91f36a)  │
│ due to prioritization in the ClusterQueue                                                                                                                      │
│   Normal  SuccessfulDeletePod      64s   tfjob-controller                     Deleted pod: tensorflow-dist-mnist2-worker-0                                     │
│   Normal  SuccessfulDeleteService  64s   tfjob-controller                     Deleted service: tensorflow-dist-mnist2-worker-0                                 │
│   Normal  SuccessfulDeletePod      64s   tfjob-controller                     Deleted pod: tensorflow-dist-mnist2-ps-0                                         │
│   Normal  SuccessfulDeleteService  64s   tfjob-controller                     Deleted service: tensorflow-dist-mnist2-ps-0     

In order to reproduce: Create TFJob first and any other Job with higher priority as second to trigger a preemption. Both jobs should use the same limited resources that last only for one at the time. TFJob has to have the runPolicy.ttlSecondsAfterFinished set above 0.

Cleanup never happens as the completion time is nil: https://github.com/kubeflow/training-operator/blame/6900714c39dcb4991b6cf3bc793e73fc7386e478/pkg/controller.v1/common/job.go#L429

What did you expect to happen?

Preempted TFJob status should be suspended and cleaned up.

│   Replica Statuses:                                                                                                                                            │
│     PS:                                                                                                                                                        │
│     Worker:                                                                                                                                                    │
│   Start Time:  2024-08-28T11:12:02Z                                                                                                                            │
│ Events:                                                                                                                                                        │
│   Type    Reason                   Age                From                                 Message                                                             │
│   ----    ------                   ----               ----                                 -------                                                             │
│   Normal  SuccessfulCreatePod      19s                tfjob-controller                     Created pod: tensorflow-dist-mnist2-worker-0                        │
│   Normal  CreatedWorkload          19s                kubeflow.org/tfjob-kueue-controller  Created Workload: default/tfjob-tensorflow-dist-mnist2-c3c39        │
│   Normal  Started                  19s                kubeflow.org/tfjob-kueue-controller  Admitted by clusterQueue cluster-queue                              │
│   Normal  TFJobResumed             19s (x2 over 19s)  tfjob-controller                     TFJob tensorflow-dist-mnist2 is resumed.                            │
│   Normal  SuccessfulCreatePod      19s                tfjob-controller                     Created pod: tensorflow-dist-mnist2-ps-0                            │
│   Normal  SuccessfulCreateService  19s                tfjob-controller                     Created service: tensorflow-dist-mnist2-ps-0                        │
│   Normal  SuccessfulCreateService  19s                tfjob-controller                     Created service: tensorflow-dist-mnist2-worker-0                    │
│   Normal  Stopped                  10s                kubeflow.org/tfjob-kueue-controller  Preempted to accommodate a workload (UID: a8028237-e9b0-4bfb-aa13-9 │
│ 6163d74c7e2) due to prioritization in the ClusterQueue                                                                                                         │
│   Normal  SuccessfulDeletePod      10s                tfjob-controller                     Deleted pod: tensorflow-dist-mnist2-ps-0                            │
│   Normal  SuccessfulDeleteService  10s                tfjob-controller                     Deleted service: tensorflow-dist-mnist2-ps-0                        │
│   Normal  SuccessfulDeletePod      10s                tfjob-controller                     Deleted pod: tensorflow-dist-mnist2-worker-0                        │
│   Normal  SuccessfulDeleteService  10s                tfjob-controller                     Deleted service: tensorflow-dist-mnist2-worker-0                    │
│   Normal  TFJobSuspended           9s (x13 over 19s)  tfjob-controller                     TFJob tensorflow-dist-mnist2 is suspended. 

Environment

Kubernetes version: v1.30.0 Training Operator version: kubeflow/training-operator:v1-855e096 Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

alculquicondor commented 3 months ago

@tenzen-y this is related to my recent comment https://github.com/kubeflow/training-operator/commit/64e39f2aa8ca51112dc02a93db85a5391cdf6cd7#r145851795

alculquicondor commented 3 months ago

It only reproduces if ttlSecondsAfterFinished is set

mszadkow commented 3 months ago

/assign

tenzen-y commented 3 months ago

/triage accepted

google-oss-prow[bot] commented 3 months ago

@tenzen-y: The label(s) triage/accepted cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubeflow/training-operator/issues/2239#issuecomment-2315701399): >/triage accepted > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tenzen-y commented 3 months ago

/triage accepted

tenzen-y commented 3 months ago

/remove-lifecycle needs-triage

tenzen-y commented 3 months ago

Preempted TFJob status should be suspended and cleaned up.

@mszadkow I guess that the expected behavior is only cleaned up Jobs, right? The training-operator does not suspend Jobs that reached ttl.

tenzen-y commented 3 months ago

@tenzen-y this is related to my recent comment 64e39f2#r145851795

@alculquicondor The training-operator TTL mechanism is not related to the Suspension mechanism. So, the above comment is not related to this bug, right?

alculquicondor commented 3 months ago

They shouldn't be related, but they are.. that's the bug.

alculquicondor commented 3 months ago

When suspending, we call this function:

https://github.com/kubeflow/training-operator/blob/6900714c39dcb4991b6cf3bc793e73fc7386e478/pkg/controller.v1/common/job.go#L420-L428

If there is no TTL, it returns quickly. Otherwise, it goes on to check the completion time, which a suspended job doesn't have. I guess we simply shouldn't be calling CleanupJob for suspend? It's not doing anything in general.

andreyvelich commented 3 months ago

/remove-label lifecycle/needs-triage

tenzen-y commented 3 months ago

If there is no TTL, it returns quickly. Otherwise, it goes on to check the completion time, which a suspended job doesn't have. I guess we simply shouldn't be calling CleanupJob for suspend? It's not doing anything in general.

That's a good point. I believe that we should just return quickly like this:

func (jc *JobController) CleanupJob(runPolicy *apiv1.RunPolicy, jobStatus apiv1.JobStatus, job interface{}) error { 
    [...] 
    ttl := runPolicy.TTLSecondsAfterFinished 
    if ttl == nil || commonutil.IsSuspended(jobStatus) { 
        return nil 
    }
    [...]