kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

add job suspend run Policy #193

Open PeterChg opened 2 years ago

PeterChg commented 2 years ago

add job partial success status

PeterChg commented 2 years ago

/assign @terrytangyuan

PeterChg commented 2 years ago

I am not sure if this is a common use case. Could you elaborate?

The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.

gaocegege commented 2 years ago

/ok-to-test

terrytangyuan commented 2 years ago

What are the changes you are trying to make to training operator?

PeterChg commented 2 years ago

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

google-oss-prow[bot] commented 2 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/common/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
terrytangyuan commented 2 years ago

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.

alculquicondor commented 1 year ago

This is not about the training job itself. This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.