KEP-2170: Support hundreds and thousands worker nodes for a single training Job

What you would like to be added?

We should support the multiple replicas per a replicatedJob like:

[...]
spec:
  replicatedJobs:
  - name:
    replicas: 5
[...]

Why is this needed?

Currently, we enforce 1 to the JobSet ReplicatedJob replicas:

https://github.com/kubeflow/training-operator/blob/9e46f9d422e71f258679c5edd306c7eddf9004f1/pkg/runtime.v2/core/trainingruntime.go#L108-L110

However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.

spec:
  replicatedJobs:
  - name: training-node
    replicas: 1
    template:
      spec:
        completions: 2000
        parallelism: 2000

After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak. Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

kubeflow / training-operator