kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

allocating master and work on different GPU nodes #224

Closed mengdong closed 4 years ago

mengdong commented 4 years ago

When allocating master and worker on GPU nodes on GKE, I notice it is working when master and workers are on the same node, however, if some worker got allocated to a different GPU node, it will stuck on ContainerCreating stage forever.

gaocegege commented 4 years ago

Can you run kubectl describe to get the pod info and show the output here?

mengdong commented 4 years ago

I release it is due to a separate persistent volume issue.