Open shaoqingyang opened 3 days ago
@shaoqingyang Are you using DeepSpeed for model training?
This issue is not caused by training-operator. You need to confirm whether the training framework you are using supports job recovery if one of the processes exits and is restarted. IIRC, DeepSpeed does not support it.
https://github.com/kubeflow/training-operator/issues/2269 re-create all PyTorchJob's pods is another solution.
What happened?
I created a pytorch job which to use three pod. when I delete a pod(worker), It will recover, but can't join into cluster. ![Uploading image.png…]()
What did you expect to happen?
Pod can join to cluster to continue train.
Environment
Kubernetes version:
Training Operator version:
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍