Open shaowei-su opened 3 months ago
If the worker pod got evicted, the entire MPIJob will run into Failed state:
Failed
status: conditions: - lastTransitionTime: "2024-08-14T19:45:39Z" lastUpdateTime: "2024-08-14T19:45:39Z" message: MPIJob xxx is created. reason: MPIJobCreated status: "True" type: Created - lastTransitionTime: "2024-08-14T19:48:02Z" lastUpdateTime: "2024-08-14T19:48:02Z" message: MPIJob xxx is running. reason: MPIJobRunning status: "False" type: Running - lastTransitionTime: "2024-08-15T04:01:42Z" lastUpdateTime: "2024-08-15T04:01:42Z" message: 1/8 workers are evicted reason: MPIJobEvicted status: "True" type: Failed replicaStatuses: Launcher: failed: 1 Worker: active: 7 failed: 1 startTime: "2024-08-14T19:45:39Z"
However, the run policy is not honored as a result and the worker pods are kept in running state.
runPolicy: backoffLimit: 1 cleanPodPolicy: Running ttlSecondsAfterFinished: 10800
If the worker pod got evicted, the entire MPIJob will run into
Failed
state:However, the run policy is not honored as a result and the worker pods are kept in running state.