kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

mpi job bug #2334

Open fyxemmmm opened 6 days ago

fyxemmmm commented 6 days ago

What happened?

v1.8.1 version

training-operator/pkg/controller.v1/mpi/mpijob_controller.go -> UpdateJobStatus

when job failed, The judgment if spec.RestartPolicy == commonv1.RestartPolicyExitCode seems to be problematic. It should be if spec.RestartPolicy != commonv1.RestartPolicyNever...

andreyvelich commented 1 day ago

cc @tenzen-y @alculquicondor @terrytangyuan