kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Fix nslookup cannot work well in initContainerTemplate #216

Closed hougangliu closed 5 years ago

hougangliu commented 5 years ago

PytorchJob workers' initContainer always try to check if master pod is up by nslookup command, however nslookup in default image busybox:1.31.0 version seems too old that it cannot work well, its exit code is always 1 for ppc64le arch even it can parse master service dns, and for amd64, it cannot work steadily as below, when I change the image to alpine:3.10, both on amd64 and ppc64le, it works well

/ # nslookup katib-suggestion-hyperband
Server:         10.0.0.10
Address:        10.0.0.10:53

Name:   katib-suggestion-hyperband.kubeflow.svc.cluster.local
Address: 10.0.223.142

*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer
*** Can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer

/ # echo $?
0
/ # nslookup katib-suggestion-hyperband
Server:         10.0.0.10
Address:        10.0.0.10:53

** server can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: NXDOMAIN

*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer
*** Can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer

/ # echo $?
1
/ # nslookup katib-suggestion-hyperband
Server:         10.0.0.10
Address:        10.0.0.10:53

** server can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: NXDOMAIN

*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer
*** Can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer

/ # echo $?
1
/ # nslookup katib-suggestion-hyperband
Server:         10.0.0.10
Address:        10.0.0.10:53

Name:   katib-suggestion-hyperband.kubeflow.svc.cluster.local
Address: 10.0.223.142

*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer
*** Can't find katib-suggestion-hyperband.kubeflow.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.svc.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.cluster.local: No answer
*** Can't find katib-suggestion-hyperband.fyre.ibm.com: No answer

/ # echo $?
0
hougangliu commented 5 years ago

/cc @johnugeorge

coveralls commented 5 years ago

Coverage Status

Coverage remained the same at 85.345% when pulling 7c7eeaf28bf2effaae1f22f8536d99b209edd76f on hougangliu:fix-worker-init into c53647cb87e6e08a1f72528d251673c6eaebc33f on kubeflow:master.

johnugeorge commented 5 years ago

Thanks @hougangliu /approve

k8s-ci-robot commented 5 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS)~~ [johnugeorge] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment