kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

PyTorchJob worker pods crashloops in non-default namespace #258

Open jobvarkey opened 4 years ago

jobvarkey commented 4 years ago

Hello,

I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.

The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).

But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.

root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994 NAME READY STATUS RESTARTS AGE jp-nb1-0 2/2 Running 0 18h pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m

kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt

pytorch_job_mnist_gloo.yaml.txt

Can anyone please help with this issue?

Thanks, Job

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

gaocegege commented 4 years ago

It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace?

jobvarkey commented 4 years ago

The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks

gaocegege commented 4 years ago

Can you show me the result of kubectl describe ns i70994?

jobvarkey commented 4 years ago

root@0939-jdeml-m01:~# kubectl describe ns i70994 Name: i70994 Labels: istio-injection=enabled katib-metricscollector-injection=enabled serving.kubeflow.org/inferenceservice=enabled Annotations: owner: I70994@verisk.com Status: Active

No resource quota.

No resource limits.

636 commented 4 years ago

Hi @jobvarkey I guess that this cause is istio-injection enabled on your ns. Could you try to append below code to template section in pytorch_job_mnist_gloo.yaml. You can disable istio-injection your PyTorchJob.

        metadata:
          annotations:
            sidecar.istio.io/inject: "false"

see: https://istio.io/docs/setup/additional-setup/sidecar-injection/

But I don't know if this change affect other problem. Could anyone explain it ?

By setting this change, I was able to run mnist_gloo.

shawnzhu commented 4 years ago

But I don't know if this change affect other problem. Could anyone explain it ?

This comment provides details https://github.com/kubeflow/kubeflow/issues/4935#issuecomment-615256808

Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., pytorch-dist-mnist-gloo-master-0.<namespace>.

By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like RuntimeError: Connection reset by peer.