kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
440 stars 218 forks source link

Worker pods not cleaned up upon `MPIJobEvicted` event #647

Open shaowei-su opened 3 months ago

shaowei-su commented 3 months ago

If the worker pod got evicted, the entire MPIJob will run into Failed state:

status:
  conditions:
  - lastTransitionTime: "2024-08-14T19:45:39Z"
    lastUpdateTime: "2024-08-14T19:45:39Z"
    message: MPIJob xxx is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2024-08-14T19:48:02Z"
    lastUpdateTime: "2024-08-14T19:48:02Z"
    message: MPIJob xxx is running.
    reason: MPIJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2024-08-15T04:01:42Z"
    lastUpdateTime: "2024-08-15T04:01:42Z"
    message: 1/8 workers are evicted
    reason: MPIJobEvicted
    status: "True"
    type: Failed
  replicaStatuses:
    Launcher:
      failed: 1
    Worker:
      active: 7
      failed: 1
  startTime: "2024-08-14T19:45:39Z"

However, the run policy is not honored as a result and the worker pods are kept in running state.

  runPolicy:
    backoffLimit: 1
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 10800