kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

fix bug about status absence when worker pod spec is invalid #606

Open congpeiqing opened 7 months ago

congpeiqing commented 7 months ago

close #604

When a worker pod fails to create, the current practice is to retry later. However, retrying does not solve the issue if the failure is due to an invalid Pod Spec. In this PR , I try to check the failure reason first and if it is due to an invalid Pod Spec, just update the Job's status to "Failed" without any retries.

google-oss-prow[bot] commented 7 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment