The training hangs after reloading one of master/worker pods

kubeflow / pytorch-operator

PyTorch on Kubernetes

Apache License 2.0

307 stars 143 forks source link

The training hangs after reloading one of master/worker pods #359

Open dmitsf opened 3 years ago

dmitsf commented 3 years ago

Hello! I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading can be different, usually, it's due to Google Cloud Engine node rescheduling. Also, I tried to kill pods myself - the behavior was the same. Can I avoid this behavior and make training tolerant to pods' reloading?

gaocegege commented 3 years ago

Can you tell us the pytorch version?

dmitsf commented 3 years ago

I use pytorch 1.9.0.

gaocegege commented 3 years ago

Are you using torch.distributed.run?

dmitsf commented 3 years ago

I don't use it at the moment. I followed mnist example to adjust my training script.

gaocegege commented 3 years ago

Can you please show us the script and the YAML file? PyTorch 1.9 introduced elastic training and it may hang.