kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

support cleanPodPolicy is Running, same as tf operator #288

Closed jiaqianjing closed 4 years ago

k8s-ci-robot commented 4 years ago

Hi @jiaqianjing. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
gaocegege commented 4 years ago

/ok-to-test

/assign @johnugeorge

Can you please explain more about why we need the PR?

jiaqianjing commented 4 years ago

/ok-to-test

/assign @johnugeorge

Can you please explain more about why we need the PR?

  1. Consistent experience with TF operator. We can set cleanPodPolicy is "Running", and do the same thing.
  2. When I submit a distributed job, when one of the nodes fails, the whole job is in the completed state, but some instances have been "Error" and some instances are still "Running". We should clean up thoese "Running" pod and release resource. At the same time, we can view the log of failed pods. image
gaocegege commented 4 years ago

SGTM

Can you add corresponding test for the new feature?

jiaqianjing commented 4 years ago

Can you add corresponding test for the new feature?

test_01

jiaqianjing commented 4 years ago

PTAL @jinchihe @andreyvelich

gaocegege commented 4 years ago

/assign @johnugeorge

gaocegege commented 4 years ago

Thanks for your contribution! :tada: :+1:

gaocegege commented 4 years ago

/approve

@johnugeorge @andreyvelich They need this feature urgently thus I am merging it. If you have any comment feel free to leave. We can fix in another PR.

k8s-ci-robot commented 4 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS)~~ [gaocegege] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment