kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
307 stars 143 forks source link

use init container for worker pod to wait master pod ready #187

Closed zlcnju closed 5 years ago

zlcnju commented 5 years ago

use init container for worker pod to wait master pod ready

fix 186

k8s-ci-robot commented 5 years ago

Hi @zlcnju. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
zlcnju commented 5 years ago

add a init container default use busybox image, add wait master running by check dns master addr. And this initcontainer config can be configured in the pytorch configmap. If aggred, I will add the configmap config in the kubeflow project

johnugeorge commented 5 years ago

/ok-to-test

coveralls commented 5 years ago

Coverage Status

Coverage remained the same at 85.217% when pulling a762f6ddf3d4f962d0cd3ae519a4b901f4f92647 on zlcnju:master into 6aa39a411eb9fe5e62b54b0a3cf09225abe7d551 on kubeflow:master.

johnugeorge commented 5 years ago

/cc @gaocegege /cc @richardsliu

zlcnju commented 5 years ago

@johnugeorge @gaocegege any other suggestions?

johnugeorge commented 5 years ago

/lgtm

gaocegege commented 5 years ago

@zlcnju Thanks for your contribution! :tada: :+1:

@tossmilestone Thanks for your review!

richardsliu commented 5 years ago

/approve

k8s-ci-robot commented 5 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS)~~ [richardsliu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
chansonzhang commented 5 years ago

May I ask which release will include this fix? If I use the 0.6 release, will I meet the same problem described in issue#186 ?

johnugeorge commented 5 years ago

Can you use latest 0.7 release

chansonzhang commented 5 years ago

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

zlcnju commented 5 years ago

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

you can check the pod create by pyjob, If init container exists, this fix included

chansonzhang commented 4 years ago

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

you can check the pod create by pyjob, If init container exists, this fix included

it do contain this fix, thank you very much