However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.
After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
What you would like to be added?
We should support the multiple replicas per a replicatedJob like:
Why is this needed?
Currently, we enforce 1 to the JobSet ReplicatedJob replicas:
https://github.com/kubeflow/training-operator/blob/9e46f9d422e71f258679c5edd306c7eddf9004f1/pkg/runtime.v2/core/trainingruntime.go#L108-L110
However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (
.spec.completions
), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak. Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.
Love this feature?
Give it a 👍 We prioritize the features with most 👍