Open jobvarkey opened 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
bug | 0.74 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace?
The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks
Can you show me the result of kubectl describe ns i70994
?
root@0939-jdeml-m01:~# kubectl describe ns i70994 Name: i70994 Labels: istio-injection=enabled katib-metricscollector-injection=enabled serving.kubeflow.org/inferenceservice=enabled Annotations: owner: I70994@verisk.com Status: Active
No resource quota.
No resource limits.
Hi @jobvarkey I guess that this cause is istio-injection enabled on your ns. Could you try to append below code to template section in pytorch_job_mnist_gloo.yaml. You can disable istio-injection your PyTorchJob.
metadata:
annotations:
sidecar.istio.io/inject: "false"
see: https://istio.io/docs/setup/additional-setup/sidecar-injection/
But I don't know if this change affect other problem. Could anyone explain it ?
By setting this change, I was able to run mnist_gloo.
But I don't know if this change affect other problem. Could anyone explain it ?
This comment provides details https://github.com/kubeflow/kubeflow/issues/4935#issuecomment-615256808
Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., pytorch-dist-mnist-gloo-master-0.<namespace>
.
By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like RuntimeError: Connection reset by peer
.
Hello,
I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.
The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).
But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.
root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994 NAME READY STATUS RESTARTS AGE jp-nb1-0 2/2 Running 0 18h pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m
kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt
pytorch_job_mnist_gloo.yaml.txt
Can anyone please help with this issue?
Thanks, Job