Closed MenglingD closed 5 years ago
Looks like a problem of kubedns. Can your pods communicate with each other without mxnet-operator?
Looks like a problem of kubedns. Can your pods communicate with each other without mxnet-operator?
Thank you very much for your reply. The kubedns seem to work well, because I can ping the mxnet-job-scheduler-0
in the container of mxnet
in the mxnet-job-worker-0
pod:
work@mxnet-job-worker-0:~/mx_distributed_training$ ping mxnet-job-scheduler-0
PING mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208) 56(84) bytes of data.
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=1 ttl=62 time=0.268 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=2 ttl=62 time=0.162 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=3 ttl=62 time=0.156 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=4 ttl=62 time=0.205 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=5 ttl=62 time=0.188 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=6 ttl=62 time=0.175 ms
64 bytes from 192-168-226-208.mxnet-job-scheduler-0.dml.svc.cluster.local (192.168.226.208): icmp_seq=7 ttl=62 time=0.192 ms
Need I test the communication between pods again without mxnet-operator for checking the status of kubedns?
Can you step into mxnet-job-worker-0 and run env | grep -E "MX_CONFIG|DMLC"
? I want to check the environment variables inside.
Can you step into mxnet-job-worker-0 and run
env | grep -E "MX_CONFIG|DMLC"
? I want to check the environment variables inside.
The environment variable is correct setting for mxnet-job-worker-0, mxnet-job-server-0 and mxnet-job-scheduler-0:
$ k exec -it -n dml mxnet-job-worker-0 -c mxnet -- /bin/bash
work@mxnet-job-worker-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=worker
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"worker","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1
work@mxnet-job-worker-0:~/mx_distributed_training$ exit
dml at vm10-1-0-10 in ~/k8s
$ k exec -it -n dml mxnet-job-server-0 -c mxnet -- /bin/bash
work@mxnet-job-server-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=server
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"server","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1
work@mxnet-job-server-0:~/mx_distributed_training$ exit
dml at vm10-1-0-10 in ~/k8s
$ k exec -it -n dml mxnet-job-scheduler-0 -c mxnet -- /bin/bash
work@mxnet-job-scheduler-0:~/mx_distributed_training$ env | grep -E "MX_CONFIG|DMLC"
DMLC_PS_ROOT_URI=mxnet-job-scheduler-0
DMLC_USE_KUBERNETES=1
DMLC_PS_ROOT_PORT=9091
DMLC_ROLE=scheduler
MX_CONFIG={"cluster":{"scheduler":[{"url":"mxnet-job-scheduler-0","port":9091}],"server":[{"url":"mxnet-job-server-0","port":9091}],"worker":[{"url":"mxnet-job-worker-0","port":9091}]},"labels":{"scheduler":"","server":"","worker":""},"task":{"type":"scheduler","index":0}}
DMLC_NUM_SERVER=1
DMLC_NUM_WORKER=1
Theoretically, mxnet-opertaor is only responsible for adding environment variables and configuring the start parameters of the container. Now the containers can be started up properly, there's no communicating problem between the containers, and you're not doing anything special, right?
It is that so, the only modification of 'mx_job_dist_gpu.yaml' is that the mxjob is created in dml namespace:
1 apiVersion: "kubeflow.org/v1beta1"
2 kind: "MXJob"
3 metadata:
4 name: "mxnet-job"
+ 5 namespace: dml
6 spec:
7 jobMode: MXTrain
8 mxReplicaSpecs:
9 Scheduler:
10 replicas: 1
11 restartPolicy: Never
12 template:
13 spec:
14 containers:
15 - name: mxnet
16 image: mxjob/mxnet:gpu
17 Server:
18 replicas: 1
19 restartPolicy: Never
20 template:
21 spec:
22 containers:
23 - name: mxnet
24 image: mxjob/mxnet:gpu
25 Worker:
26 replicas: 1
27 restartPolicy: Never
28 template:
29 spec:
30 containers:
31 - name: mxnet
32 image: mxjob/mxnet:gpu
33 command: ["python"]
34 args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","10","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
35 resources:
36 limits:
37 nvidia.com/gpu: 1
It seems that the istio plugin is conflict with mxnet-operator, the namespace dml is create by the profile for 'Multi-user Isolation' in kubeflow: Multi-user Isolation. The istio plugin is installed for installing Kubeflow on your existing Kubernetes cluster using kfctl_k8s_istio config. The namespace created by kubeflow profile is labeled with istio-injection=enabled.
Name: dml
Labels: istio-injection=enabled
...
And pods created by the mxnet-job running in this namespace each is composed with the containers of 'mxnet', the container of 'istio-proxy' and the initContainer 'istio-init' which would exit after it finished. It is seem that the container of 'istio-proxy' raises the communication problem for mxnet-job:
spec:
containers:
- args:
...
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
...
- args:
...
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
...
initContainers:
- args:
...
image: docker.io/istio/proxy_init:1.2.5
imagePullPolicy: IfNotPresent
name: istio-init
...
If the mxnet-job is created in the namespace without labeled 'istio-injection=enabled', everything is ok.
Thank you very much for your feedback. We will pay attention to this bug.
kubernetes version: v1.15.2 kubeflow version: v0.6.1 mxnet-operator version: v1beta1
I can not startup the example training job, after followed the steps of README.md.
Firstly, I have installed kubeflow in k8s cluster and got the
mxjob
support:After creating
mxjobs
with following,I got pods status:All pods of the
mxjobs
which created are runining, even the containers in those pods have healthy condition:After that, I have step into the
mxnet
container inmxnet-job-worker-0
pod and found out that the training script was blocked on the codekv = mx.kvstore.create(args.kv_store)
in L149:/incubator-mxnet/example/image-classification/common/fit.py
:The possible reasons which I can think out for above situation is that the communication is disabled bewteen the pods or containers, so I have manually pinged
mxnet-job-scheduler-0
andmxnet-job-server-0
pods which are exposed by k8sservice
bound to those innermxnet-job
containers, and the connection between those containers is ok. But I can not ssh those containers with thehostname
s likemxnet-job-server-0
, which means the precondition for mxnet distributed training is not meeted as there is no sshservice
for mxnet distributed training.Is there any incorrect operation for startup
mxjob
withkubectl
or any wrong configures inmx_job_dist_gpu.yaml
? If so, please let me know, and thank you for some suggestions.