Volcano provides comprehensive scheduling features for AI workloads within a single cluster domain. As more users manage workloads across multiple Kubernetes clusters, especially in large-scale model training scenarios, a single cluster often cannot meet the computational power demands of AI tasks. Users are seeking the ability to submit large AI model training tasks across multiple clusters in a unified way. To address these issues, Volcano needs to offer scheduling capabilities for multi-cluster AI tasks, including multi-cluster Gang scheduling and queue management.
Karmada, a multi-cluster orchestration system, is gradually becoming the industry standard. Volcano can build upon Karmada's existing capabilities to develop AI job scheduling in multi-cluster scenarios, while also addressing the gaps in Karmada, such as queue management.
I am currently working on this project. In July this year, I applied for the Summer of Open Source Program (OSPP). The task goal is to support queue capacity management in multi-cluster AI workload scheduling.
It pauses the scheduling of all ResourceBinding through a Mutating Webhook, allowing us to implement queue capabilities in a manner similar to Kueue.
This approach avoids implementing queue management directly in Karmada, whose focus is on multi-cluster rather than task scheduling, making this loosely coupled method easier to advance.
volcano-global-controller-manager
The controller-manager consists of two parts: controllers and dispatcher.
Controllers
Controllers create associated PodGroups for each Volcano Job/Deployment/Pod, facilitating scheduling by the dispatcher and aligning with Volcano's approach in single-cluster scenarios.
Dispatcher
The dispatcher monitors all pending, paused tasks (Volcano Job/Deployment/Pod). Currently, it can resume task scheduling (i.e., dispatch) based on task priority. Future queue-related capabilities will also be implemented on top of it.
volcano-global-scheduler
The scheduler is implemented through the extension points of the Karmada Scheduler. We will inject some necessary capabilities during the AssignReplica (ReplicaScheduling) phase, such as Gang scheduling, capacity management, and the ability to dispatch tasks that cannot be split across multiple clusters.
Implementing the scheduler based on the Karmada Scheduler was not our initial plan. Initially, we aimed to develop the complete Karmada Scheduler ourselves and implement all required capabilities directly on top of it. However, this approach is highly complex, and it would be difficult to stay in sync with the Karmada Scheduler's updates. Nonetheless, it would likely progress much faster than the current approach, as it wouldn't rely on the community to improve the missing capabilities in Karmada.
In implementing capacity management, we have two approaches based on PodGroup and ResourceBinding. For the first approach, when dispatching tasks to subclusters, we need to also dispatch the associated PodGroup resources. However, currently Karmada can only automatically dispatch explicitly linked resources, which does not meet our needs. Therefore, our idea is to customize the GetDependencies method to add the associated PodGroup to the dependency resources. But this method currently has some flaws. The results of Default GetDependencies and Customize GetDependencies cannot be merged. If we modify it, we would need to fully implement the capabilities of Default. Thus, we have raised this issue, hoping to automatically merge the results of GetDependencies.
The PodGroup-based approach may not necessarily be our final implementation, and other approaches might be considered in the future.
When Karmada dispatches workloads to sub-clusters, it follows a simple first-come, first-served basis. However, our goal is for each workload to have its corresponding priority, allowing the Karmada Scheduler to dispatch workloads one by one in accordance with our queue capabilities.
The proposal is currently being refined and is expected to be merged in September.
As mentioned earlier, we aim to implement queue capabilities in a manner similar to Kueue, rather than directly in the Karmada Scheduler. However, Karmada's ResourceBinding resources do not have a scheduling pause gate similar to Deployment/Pod. Therefore, we raised this issue and are continuously pushing it forward.
The proposal for this feature has been merged, and the capability is currently being implemented. It is expected to be available in the next version of Karmada as a FeatureGate.
As previously mentioned, the volcano-global-scheduler is to be based on the Karmada Scheduler, and we intend to inject our needed capabilities, such as Gang scheduling, during the AssignReplica (ReplicaScheduling) phase through plugins. However, the Karmada Scheduler currently only allows plugins during the FilterCluster and ScoreCluster stages, which does not meet our needs. Therefore, we aim to enhance the extensibility of the Karmada Scheduler while also meeting our requirements.
A proposal has already been made to achieve this capability, and it also provides some ideas for user-defined scheduling strategies and thoughts on splitting multi-template resources. Currently, Karmada does not support any multi-template resources. We hope to enhance the extensibility of Karmada, including but not limited to managing future custom plugins in a manner similar to Kubernetes scheduler-plugins, allowing users to submit their custom strategies and plugins to Karmada, rather than being limited to the current Duplicated and Divided strategies.
Current Situation
Capabilities we have achieved:
Pausing the dispatch of ResourceBinding (Pod, Deployment, Volcano Job) to sub-clusters
Dispatching tasks to sub-clusters based on the priority of each ResourceBinding (Pod, Deployment, Volcano Job)
Job status synchronization (from sub-cluster to control plane cluster)
What is the problem you're trying to solve
Volcano support multi-cloud AI job scheduling
Describe the solution you'd like
Introduction
Volcano provides comprehensive scheduling features for AI workloads within a single cluster domain. As more users manage workloads across multiple Kubernetes clusters, especially in large-scale model training scenarios, a single cluster often cannot meet the computational power demands of AI tasks. Users are seeking the ability to submit large AI model training tasks across multiple clusters in a unified way. To address these issues, Volcano needs to offer scheduling capabilities for multi-cluster AI tasks, including multi-cluster Gang scheduling and queue management.
Karmada, a multi-cluster orchestration system, is gradually becoming the industry standard. Volcano can build upon Karmada's existing capabilities to develop AI job scheduling in multi-cluster scenarios, while also addressing the gaps in Karmada, such as queue management.
I am currently working on this project. In July this year, I applied for the Summer of Open Source Program (OSPP). The task goal is to support queue capacity management in multi-cluster AI workload scheduling.
OSPP: Volcano supports queue capacity management capabilities in multi-cluster AI workload scheduling
volcano-global Project
Mentor: @lowang-bh
Efforts Made
Components
volcano-global-webhook-manager
It pauses the scheduling of all ResourceBinding through a Mutating Webhook, allowing us to implement queue capabilities in a manner similar to Kueue.
This approach avoids implementing queue management directly in Karmada, whose focus is on multi-cluster rather than task scheduling, making this loosely coupled method easier to advance.
volcano-global-controller-manager
The controller-manager consists of two parts: controllers and dispatcher.
Controllers
Controllers create associated PodGroups for each Volcano Job/Deployment/Pod, facilitating scheduling by the dispatcher and aligning with Volcano's approach in single-cluster scenarios.
Dispatcher
The dispatcher monitors all pending, paused tasks (Volcano Job/Deployment/Pod). Currently, it can resume task scheduling (i.e., dispatch) based on task priority. Future queue-related capabilities will also be implemented on top of it.
volcano-global-scheduler
The scheduler is implemented through the extension points of the Karmada Scheduler. We will inject some necessary capabilities during the AssignReplica (ReplicaScheduling) phase, such as Gang scheduling, capacity management, and the ability to dispatch tasks that cannot be split across multiple clusters.
Implementing the scheduler based on the Karmada Scheduler was not our initial plan. Initially, we aimed to develop the complete Karmada Scheduler ourselves and implement all required capabilities directly on top of it. However, this approach is highly complex, and it would be difficult to stay in sync with the Karmada Scheduler's updates. Nonetheless, it would likely progress much faster than the current approach, as it wouldn't rely on the community to improve the missing capabilities in Karmada.
Related Issues
[Feature] Support merge GetDependensis result
In implementing capacity management, we have two approaches based on PodGroup and ResourceBinding. For the first approach, when dispatching tasks to subclusters, we need to also dispatch the associated PodGroup resources. However, currently Karmada can only automatically dispatch explicitly linked resources, which does not meet our needs. Therefore, our idea is to customize the GetDependencies method to add the associated PodGroup to the dependency resources. But this method currently has some flaws. The results of Default GetDependencies and Customize GetDependencies cannot be merged. If we modify it, we would need to fully implement the capabilities of Default. Thus, we have raised this issue, hoping to automatically merge the results of GetDependencies.
The PodGroup-based approach may not necessarily be our final implementation, and other approaches might be considered in the future.
[Feature] Add priority field to ResourceBinding
When Karmada dispatches workloads to sub-clusters, it follows a simple first-come, first-served basis. However, our goal is for each workload to have its corresponding priority, allowing the Karmada Scheduler to dispatch workloads one by one in accordance with our queue capabilities.
The proposal is currently being refined and is expected to be merged in September.
[Feature] Support suspend ResourceBinding when create
As mentioned earlier, we aim to implement queue capabilities in a manner similar to Kueue, rather than directly in the Karmada Scheduler. However, Karmada's ResourceBinding resources do not have a scheduling pause gate similar to Deployment/Pod. Therefore, we raised this issue and are continuously pushing it forward.
The proposal for this feature has been merged, and the capability is currently being implemented. It is expected to be available in the next version of Karmada as a FeatureGate.
[Feature] Karmada-scheduler support custom-plugin when ReplicaScheduling
As previously mentioned, the volcano-global-scheduler is to be based on the Karmada Scheduler, and we intend to inject our needed capabilities, such as Gang scheduling, during the AssignReplica (ReplicaScheduling) phase through plugins. However, the Karmada Scheduler currently only allows plugins during the FilterCluster and ScoreCluster stages, which does not meet our needs. Therefore, we aim to enhance the extensibility of the Karmada Scheduler while also meeting our requirements.
A proposal has already been made to achieve this capability, and it also provides some ideas for user-defined scheduling strategies and thoughts on splitting multi-template resources. Currently, Karmada does not support any multi-template resources. We hope to enhance the extensibility of Karmada, including but not limited to managing future custom plugins in a manner similar to Kubernetes scheduler-plugins, allowing users to submit their custom strategies and plugins to Karmada, rather than being limited to the current Duplicated and Divided strategies.
Current Situation
Capabilities we have achieved:
Capabilities being implemented:
Additional context
No response