kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

Open tenzen-y opened 2 weeks ago

tenzen-y commented 2 weeks ago

What you would like to be added?

We should support the multiple replicas per a replicatedJob like:

[...]
spec:
  replicatedJobs:
  - name:
    replicas: 5
[...]

Why is this needed?

Currently, we enforce 1 to the JobSet ReplicatedJob replicas:

https://github.com/kubeflow/training-operator/blob/9e46f9d422e71f258679c5edd306c7eddf9004f1/pkg/runtime.v2/core/trainingruntime.go#L108-L110

However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.

spec:
  replicatedJobs:
  - name: training-node
    replicas: 1
    template:
      spec:
        completions: 2000
        parallelism: 2000

After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak. Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

tenzen-y commented 2 weeks ago

/remove-label lifecycle/needs-triage