Open tenzen-y opened 6 months ago
Is this something that is happening in the training-operator too? If not, could it make it harder to merge them in the future? I suppose not, as the plan is to use the mpi-operator as a library in the training-operator, right?
Is this something that is happening in the training-operator too?
Yes, the training-operator has a plan to migrate Indexed Job as well: https://github.com/kubeflow/training-operator/issues/1718
However, we (training-operator) haven't decided yet which ones (using mpi-operator as a library or migrating to Indexed job) we should work on first.
Ah, in the training-operator, the last piece to migrate to the indexed job is JobSuccessPolicy
(KEP-3998).
Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.
This is great. Good to know that elastic semantics can be maintained.
So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.
I am ok with the timeline.
Part-of: #373
Currently, the mpi-operator manages the plain pod workers. However, the management mechanism is similar to kubernetes batch/job, which is a reinvention of the wheel, although I understand the batch/job didn't have all features to replace the plain pod with batch/job in the past.
Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.
So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.
Let me know what you think. @alculquicondor @terrytangyuan