kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

Replace the plain pod workers with Indexed Job #613

Open tenzen-y opened 6 months ago

tenzen-y commented 6 months ago

Part-of: #373

Currently, the mpi-operator manages the plain pod workers. However, the management mechanism is similar to kubernetes batch/job, which is a reinvention of the wheel, although I understand the batch/job didn't have all features to replace the plain pod with batch/job in the past.

Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.

So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.

Let me know what you think. @alculquicondor @terrytangyuan

alculquicondor commented 6 months ago

Is this something that is happening in the training-operator too? If not, could it make it harder to merge them in the future? I suppose not, as the plan is to use the mpi-operator as a library in the training-operator, right?

tenzen-y commented 6 months ago

Is this something that is happening in the training-operator too?

Yes, the training-operator has a plan to migrate Indexed Job as well: https://github.com/kubeflow/training-operator/issues/1718

However, we (training-operator) haven't decided yet which ones (using mpi-operator as a library or migrating to Indexed job) we should work on first.

tenzen-y commented 6 months ago

Ah, in the training-operator, the last piece to migrate to the indexed job is JobSuccessPolicy (KEP-3998).

terrytangyuan commented 6 months ago

Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.

This is great. Good to know that elastic semantics can be maintained.

So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.

I am ok with the timeline.