kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Support exitCode restartPolicy #537

Open Syulin7 opened 1 year ago

Syulin7 commented 1 year ago

MPIJob doesn't support restartPolicy=ExitCode

ref: https://github.com/kubeflow/training-operator/issues/1768

tenzen-y commented 1 year ago

/kind feature

alculquicondor commented 1 year ago

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

alculquicondor commented 1 year ago

If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718

tenzen-y commented 1 year ago

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Probably, he expects a similar behavior to other training controllers (e.g., tfjob).

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718

In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers.

However, it might be worth supporting a similar logic to other CustomJobs in the short term. This doesn't mean the changes break current logic and API.

WDYT?

alculquicondor commented 1 year ago

I guess we'll need a new API version anyways.

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

tenzen-y commented 1 year ago

I guess we'll need a new API version anyways.

I don't think we need a new API version since we can support restartPolicy=ExitCode, the same as other CustomJobs, using the following logic:

            // Get the exit code of the container.
            var exitCode int32 = 0xbeef // magic number
            for _, status := range pod.Status.ContainerStatuses {
                state := status.State
                if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
                    exitCode = state.Terminated.ExitCode
                    logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
                    r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
                }
            }
            // Check if the pod is retryable.
            if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
                if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
                    failedPodsCount.Inc()
                    logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
                    if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
                        return err
                    }
                }
            }

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/reconciler.v1/common/pod.go#L172-L191

If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy.

    // RestartPolicyExitCode policy means that user should add exit code by themselves,
    // The job operator will check these exit codes to
    // determine the behavior when an error occurs:
    // - 1-127: permanent error, do not restart.
    // - 128-255: retryable error, will restart the pod.
    RestartPolicyExitCode RestartPolicy = "ExitCode"

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

tenzen-y commented 1 year ago

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

I agree.

alculquicondor commented 1 year ago

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

tenzen-y commented 1 year ago

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs.

In the short term, I would like to support the restartPolicy=ExitCode, similar to the other CustomJobs in MPIJob v2beta1. Since if we migrate everything to batch/job, we might need to use Elastic Indexed Job for the elastic training (e.g., elastic horovod).

WDYT?

alculquicondor commented 1 year ago

sounds good

tenzen-y commented 1 year ago

/assign