kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

Pytorch job running with pod exception unable to recover after retry #2300

Open shaoqingyang opened 3 days ago

shaoqingyang commented 3 days ago

What happened?

I created a pytorch job which to use three pod. image when I delete a pod(worker), It will recover, but can't join into cluster. ![Uploading image.png…]()

What did you expect to happen?

Pod can join to cluster to continue train.

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

shaoqingyang commented 3 days ago

image

Syulin7 commented 20 hours ago

@shaoqingyang Are you using DeepSpeed for model training?

This issue is not caused by training-operator. You need to confirm whether the training framework you are using supports job recovery if one of the processes exits and is restarted. IIRC, DeepSpeed does not support it.

Syulin7 commented 20 hours ago

https://github.com/kubeflow/training-operator/issues/2269 re-create all PyTorchJob's pods is another solution.