can not startup example training job

MenglingD commented 5 years ago

kubernetes version: v1.15.2 kubeflow version: v0.6.1 mxnet-operator version: v1beta1

I can not startup the example training job, after followed the steps of README.md.

Firstly, I have installed kubeflow in k8s cluster and got the mxjob support:

$ kubectl get crd | grep kube            
experiments.kubeflow.org                      2019-09-04T04:05:29Z
mpijobs.kubeflow.org                          2019-09-18T04:08:45Z
mxjobs.kubeflow.org                           2019-09-04T04:05:30Z
notebooks.kubeflow.org                        2019-09-04T04:05:29Z
poddefaults.kubeflow.org                      2019-09-04T04:05:28Z
profiles.kubeflow.org                         2019-09-04T04:05:31Z
pytorchjobs.kubeflow.org                      2019-09-04T04:05:29Z
scheduledworkflows.kubeflow.org               2019-09-04T04:05:31Z
tfjobs.kubeflow.org                           2019-09-04T04:05:30Z
trials.kubeflow.org                           2019-09-04T04:05:29Z
viewers.kubeflow.org                          2019-09-04T04:05:31Z

After creating mxjobs with following，I got pods status:

$ kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml
$ kubectl get pods -n dml -o wide
NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
mxnet-job-scheduler-0   2/2     Running   0          4h36m   192.168.107.59   p40-4   <none>           <none>
mxnet-job-server-0      2/2     Running   0          4h36m   192.168.107.60   p40-4   <none>           <none>
mxnet-job-worker-0      2/2     Running   0          4h36m   192.168.107.61   p40-4   <none>           <none>

All pods of the mxjobs which created are runining, even the containers in those pods have healthy condition:

246 status:
247   conditions:
248   - lastProbeTime: null
249     lastTransitionTime: "2019-09-24T08:47:58Z"
250     status: "True"
251     type: Initialized
252   - lastProbeTime: null
253     lastTransitionTime: "2019-09-24T08:48:02Z"
254     status: "True"
255     type: Ready
256   - lastProbeTime: null
257     lastTransitionTime: "2019-09-24T08:48:02Z"
258     status: "True"
259     type: ContainersReady
260   - lastProbeTime: null
261     lastTransitionTime: "2019-09-24T08:47:55Z"
262     status: "True"
263     type: PodScheduled
264   containerStatuses:
265   - containerID: docker://91d6fccccc646b70022e4dc1e840140096d2566e12a547ed01c31a7f2fff5e62
266     image: istio/proxyv2:1.2.5
267     imageID: docker-pullable://istio/proxyv2@sha256:8f210c3d09beb6b8658a4255d9ac30e25549295834a44083ed67d652ad7453e4
268     lastState: {}
269     name: istio-proxy
270     ready: true
271     restartCount: 0
272     state:
273       running:
274         startedAt: "2019-09-24T08:47:59Z"
275   - containerID: docker://b1892b0750dbae31b77c53ddf5883f35a1245b6d165f22ad5c6358ec4ced830b
276     image: mxjob/mxnet:gpu
277     imageID: docker-pullable://mxjob/mxnet@sha256:f0ab7315578dbcddab6af926855d2586190f4f0c3dd5f4bb34f28a5a15ac7c84
278     lastState: {}
279     name: mxnet
280     ready: true
281     restartCount: 0
282     state:
283       running:
284         startedAt: "2019-09-24T08:47:59Z"
285   hostIP: 10.1.2.14
286   initContainerStatuses:
287   - containerID: docker://fd7d003ef7ae3a8da1b12ebdbb64e731ed2aab767a9f230fc4cd840207f9f7fb
288     image: istio/proxy_init:1.2.5                                                                                                                                                                       
289     imageID: docker-pullable://istio/proxy_init@sha256:c9964a8c1c28b85cc631bbc90390eac238c90f82c8f929495d1e9f9a9135b724
290     lastState: {}
291     name: istio-init
292     ready: true
293     restartCount: 0
294     state:
295       terminated:
296         containerID: docker://fd7d003ef7ae3a8da1b12ebdbb64e731ed2aab767a9f230fc4cd840207f9f7fb
297         exitCode: 0
298         finishedAt: "2019-09-24T08:47:58Z"
299         reason: Completed
300         startedAt: "2019-09-24T08:47:57Z"
301   phase: Running
302   podIP: 192.168.107.61
303   qosClass: Burstable
304   startTime: "2019-09-24T08:47:55Z"

After that, I have step into the mxnet container in mxnet-job-worker-0 pod and found out that the training script was blocked on the code kv = mx.kvstore.create(args.kv_store) in L149:/incubator-mxnet/example/image-classification/common/fit.py:

141 def fit(args, network, data_loader, **kwargs):
142     """
143     train a model
144     args : argparse returns
145     network : the symbol definition of the nerual network
146     data_loader : function that returns the train and val data iterators
147     """
148     # kvstore
149     logging.info('creating kvstore')  # added to locate the blocked point
150     kv = mx.kvstore.create(args.kv_store)
151     logging.info('created kvstore')   # added to locate the blocked point

The possible reasons which I can think out for above situation is that the communication is disabled bewteen the pods or containers, so I have manually pinged mxnet-job-scheduler-0 and mxnet-job-server-0 pods which are exposed by k8s service bound to those inner mxnet-job containers, and the connection between those containers is ok. But I can not ssh those containers with the hostnames like mxnet-job-server-0 , which means the precondition for mxnet distributed training is not meeted as there is no ssh service for mxnet distributed training.

Is there any incorrect operation for startup mxjob with kubectl or any wrong configures in mx_job_dist_gpu.yaml? If so, please let me know, and thank you for some suggestions.

KingOnTheStar commented 5 years ago

Looks like a problem of kubedns. Can your pods communicate with each other without mxnet-operator?

MenglingD commented 5 years ago

Looks like a problem of kubedns. Can your pods communicate with each other without mxnet-operator?

Thank you very much for your reply. The kubedns seem to work well, because I can ping the mxnet-job-scheduler-0 in the container of mxnet in the mxnet-job-worker-0 pod:

work@mxnet-job-worker-0:~/mx_distributed_training$ ping mxnet-job-scheduler-0
PING mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208) 56(84) bytes of data.
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=1 ttl=62 time=0.268 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=2 ttl=62 time=0.162 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=3 ttl=62 time=0.156 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=4 ttl=62 time=0.205 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=5 ttl=62 time=0.188 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=6 ttl=62 time=0.175 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=7 ttl=62 time=0.192 ms

Need I test the communication between pods again without mxnet-operator for checking the status of kubedns?

KingOnTheStar commented 5 years ago

Can you step into mxnet-job-worker-0 and run env | grep -E "MX_CONFIG|DMLC"? I want to check the environment variables inside.

MenglingD commented 5 years ago

Can you step into mxnet-job-worker-0 and run env | grep -E "MX_CONFIG|DMLC"? I want to check the environment variables inside.

The environment variable is correct setting for mxnet-job-worker-0, mxnet-job-server-0 and mxnet-job-scheduler-0:

$ k exec -it -n dml mxnet-job-worker-0 -c mxnet -- /bin/bash
work@mxnet-job-worker-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=worker
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"worker","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1
work@mxnet-job-worker-0:~/mx_distributed_training$ exit

dml at vm10-1-0-10 in ~/k8s 
$ k exec -it -n dml mxnet-job-server-0 -c mxnet -- /bin/bash 
work@mxnet-job-server-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=server
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"server","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1
work@mxnet-job-server-0:~/mx_distributed_training$ exit

dml at vm10-1-0-10 in ~/k8s 
$ k exec -it -n dml mxnet-job-scheduler-0 -c mxnet -- /bin/bash 
work@mxnet-job-scheduler-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=scheduler
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"scheduler","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1

KingOnTheStar commented 5 years ago

Theoretically, mxnet-opertaor is only responsible for adding environment variables and configuring the start parameters of the container. Now the containers can be started up properly, there's no communicating problem between the containers, and you're not doing anything special, right?

MenglingD commented 5 years ago

It is that so, the only modification of 'mx_job_dist_gpu.yaml' is that the mxjob is created in dml namespace:

    1 apiVersion: "kubeflow.org/v1beta1"
    2 kind: "MXJob"
    3 metadata:
    4   name: "mxnet-job"
+   5   namespace: dml
    6 spec:
    7   jobMode: MXTrain
    8   mxReplicaSpecs:
    9     Scheduler:
   10       replicas: 1
   11       restartPolicy: Never
   12       template:
   13         spec:
   14           containers:
   15             - name: mxnet
   16               image: mxjob/mxnet:gpu
   17     Server:
   18       replicas: 1
   19       restartPolicy: Never
   20       template:
   21         spec:
   22           containers:
   23             - name: mxnet
   24               image: mxjob/mxnet:gpu
   25     Worker:
   26       replicas: 1
   27       restartPolicy: Never
   28       template:
   29         spec:
   30           containers:
   31             - name: mxnet
   32               image: mxjob/mxnet:gpu
   33               command: ["python"]
   34               args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","10","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
   35               resources:
   36                 limits:
   37                   nvidia.com/gpu: 1

MenglingD commented 5 years ago

It seems that the istio plugin is conflict with mxnet-operator, the namespace dml is create by the profile for 'Multi-user Isolation' in kubeflow: Multi-user Isolation. The istio plugin is installed for installing Kubeflow on your existing Kubernetes cluster using kfctl_k8s_istio config. The namespace created by kubeflow profile is labeled with istio-injection=enabled.

Name:         dml
Labels:       istio-injection=enabled
...

And pods created by the mxnet-job running in this namespace each is composed with the containers of 'mxnet', the container of 'istio-proxy' and the initContainer 'istio-init' which would exit after it finished. It is seem that the container of 'istio-proxy' raises the communication problem for mxnet-job:

spec:
  containers:
  - args:
    ...
    name: mxnet
    ports:
    - containerPort: 9091
      name: mxjob-port
      protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
     ...
  - args:
    ...
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    ...
initContainers:
  - args:
    ...
    image: docker.io/istio/proxy_init:1.2.5
    imagePullPolicy: IfNotPresent
    name: istio-init
    ...

If the mxnet-job is created in the namespace without labeled 'istio-injection=enabled', everything is ok.

KingOnTheStar commented 5 years ago

Thank you very much for your feedback. We will pay attention to this bug.

kubeflow / mxnet-operator

can not startup example training job #50