Open Shuai-Xie opened 3 years ago
Hi, everyone.
I want to test the failure tolerance of PytorchJob.
I started a PytorchJob with 1 master and 3 workers.
$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE mnist-ddp-master-0 1/1 Running 0 2m55s 11.80.0.36 11.71.1.160 mnist-ddp-worker-0 1/1 Running 0 2m55s 11.80.0.37 11.71.1.160 mnist-ddp-worker-1 1/1 Running 0 2m55s 11.80.0.38 11.71.1.160 mnist-ddp-worker-2 1/1 Running 0 89s 11.80.0.46 11.71.1.160
It trains fine.
Then I deleted a worker.
$ kubectl delete pod mnist-ddp-worker-1
As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.
restartPolicy: OnFailure
mnist-ddp-worker-1
But sadly, I can't see this newborn worker join the DDP training.
Thanks.
This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator
haolei, gege @gaocegege
Hi, everyone.
I want to test the failure tolerance of PytorchJob.
I started a PytorchJob with 1 master and 3 workers.
It trains fine.
Then I deleted a worker.
As I set
restartPolicy: OnFailure
, this pod will restart quickly with the same namemnist-ddp-worker-1
.But sadly, I can't see this newborn worker join the DDP training.
Thanks.