kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
307 stars 143 forks source link

PytorchJob DDP training will stop if I delete a worker pod #364

Open Shuai-Xie opened 3 years ago

Shuai-Xie commented 3 years ago

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   11.80.0.36   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   11.80.0.37   11.71.1.160
mnist-ddp-worker-1   1/1     Running   0          2m55s   11.80.0.38   11.71.1.160
mnist-ddp-worker-2   1/1     Running   0          89s     11.80.0.46   11.71.1.160

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

gaocegege commented 3 years ago

This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator

Shuai-Xie commented 3 years ago

haolei, gege @gaocegege