kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Example PytorchJob is not starting #264

Open natalytvinova opened 4 years ago

natalytvinova commented 4 years ago

I set up Kubeflow v0.6.0 on Microk8s v1.17. After executing kubectl create -f pytorch_job_mnist_gloo.yaml from the example I can see PytorchJob created, but no events on it happening and no new pods created. Is this example still relevant?

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

maartenpants commented 4 years ago

I have the same issue with Kubeflow v1.0 and Microk8s v1.18

sakaia commented 4 years ago

I also met same issue on my environment (kubeflow v1.0.1 with microk8s 1.15) the status is follows.

$ kubectl get pods 
NAME                               READY   STATUS    RESTARTS   AGE
pytorch-dist-mnist-gloo-master-0   0/1     Pending   0          27m
pytorch-dist-mnist-gloo-worker-0   0/1     Pending   0          27m
sakaia commented 4 years ago

In my environment, logs are follows . I edit v1/pytorch_job_mnist_gloo.yaml for image as gcr.io/kubeflow-ci/pytorch_dist_mnist:latest and comment out GPU. Is there something needed for running sample?

$ kubectl logs pytorch-dist-mnist-gloo-master-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-master-0" is waiting to start: trying and failing to pull image
$ kubectl logs pytorch-dist-mnist-gloo-worker-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-worker-0" is waiting to start: PodInitializing
sakaia commented 4 years ago

I edit pytorch_job_mnist_gloo.yaml image attribute. it works fine.

image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
sakaia commented 4 years ago

you should check the pytorchjob status by

kubectl get events
jvujjini commented 4 years ago

After running docker build, the image that gets created is on the local registry. This image should be loaded into the microk8s cluster before creating the job. The instructions for this are available here.

sakaia commented 4 years ago

Thank you for your suggestion. My expection is just working within README.md operation. I hope the document wrote need to edit imege: