OSPP 2024: Volcano Support Multi-Cloud AI Job Scheduling(queue capacity management)

What is the problem you're trying to solve

Volcano support multi-cloud AI job scheduling

Describe the solution you'd like

Introduction

Volcano provides comprehensive scheduling features for AI workloads within a single cluster domain. As more users manage workloads across multiple Kubernetes clusters, especially in large-scale model training scenarios, a single cluster often cannot meet the computational power demands of AI tasks. Users are seeking the ability to submit large AI model training tasks across multiple clusters in a unified way. To address these issues, Volcano needs to offer scheduling capabilities for multi-cluster AI tasks, including multi-cluster Gang scheduling and queue management.

Karmada, a multi-cluster orchestration system, is gradually becoming the industry standard. Volcano can build upon Karmada's existing capabilities to develop AI job scheduling in multi-cluster scenarios, while also addressing the gaps in Karmada, such as queue management.

I am currently working on this project. In July this year, I applied for the Summer of Open Source Program (OSPP). The task goal is to support queue capacity management in multi-cluster AI workload scheduling.

OSPP: Volcano supports queue capacity management capabilities in multi-cluster AI workload scheduling

volcano-global Project

Mentor: @lowang-bh

Architecture Design

Efforts Made

Components

volcano-global-webhook-manager

It pauses the scheduling of all ResourceBinding through a Mutating Webhook, allowing us to implement queue capabilities in a manner similar to Kueue.

This approach avoids implementing queue management directly in Karmada, whose focus is on multi-cluster rather than task scheduling, making this loosely coupled method easier to advance.

volcano-global-controller-manager

The controller-manager consists of two parts: controllers and dispatcher.

Controllers

Controllers create associated PodGroups for each Volcano Job/Deployment/Pod, facilitating scheduling by the dispatcher and aligning with Volcano's approach in single-cluster scenarios.

Dispatcher

The dispatcher monitors all pending, paused tasks (Volcano Job/Deployment/Pod). Currently, it can resume task scheduling (i.e., dispatch) based on task priority. Future queue-related capabilities will also be implemented on top of it.

volcano-global-scheduler

The scheduler is implemented through the extension points of the Karmada Scheduler. We will inject some necessary capabilities during the AssignReplica (ReplicaScheduling) phase, such as Gang scheduling, capacity management, and the ability to dispatch tasks that cannot be split across multiple clusters.

Implementing the scheduler based on the Karmada Scheduler was not our initial plan. Initially, we aimed to develop the complete Karmada Scheduler ourselves and implement all required capabilities directly on top of it. However, this approach is highly complex, and it would be difficult to stay in sync with the Karmada Scheduler's updates. Nonetheless, it would likely progress much faster than the current approach, as it wouldn't rely on the community to improve the missing capabilities in Karmada.

Related Issues

[Feature] Support merge GetDependensis result

In implementing capacity management, we have two approaches based on PodGroup and ResourceBinding. For the first approach, when dispatching tasks to subclusters, we need to also dispatch the associated PodGroup resources. However, currently Karmada can only automatically dispatch explicitly linked resources, which does not meet our needs. Therefore, our idea is to customize the GetDependencies method to add the associated PodGroup to the dependency resources. But this method currently has some flaws. The results of Default GetDependencies and Customize GetDependencies cannot be merged. If we modify it, we would need to fully implement the capabilities of Default. Thus, we have raised this issue, hoping to automatically merge the results of GetDependencies.

The PodGroup-based approach may not necessarily be our final implementation, and other approaches might be considered in the future.

[Feature] Add priority field to ResourceBinding

When Karmada dispatches workloads to sub-clusters, it follows a simple first-come, first-served basis. However, our goal is for each workload to have its corresponding priority, allowing the Karmada Scheduler to dispatch workloads one by one in accordance with our queue capabilities.

The proposal is currently being refined and is expected to be merged in September.

[Feature] Support suspend ResourceBinding when create

As mentioned earlier, we aim to implement queue capabilities in a manner similar to Kueue, rather than directly in the Karmada Scheduler. However, Karmada's ResourceBinding resources do not have a scheduling pause gate similar to Deployment/Pod. Therefore, we raised this issue and are continuously pushing it forward.

The proposal for this feature has been merged, and the capability is currently being implemented. It is expected to be available in the next version of Karmada as a FeatureGate.

[Feature] Karmada-scheduler support custom-plugin when ReplicaScheduling

As previously mentioned, the volcano-global-scheduler is to be based on the Karmada Scheduler, and we intend to inject our needed capabilities, such as Gang scheduling, during the AssignReplica (ReplicaScheduling) phase through plugins. However, the Karmada Scheduler currently only allows plugins during the FilterCluster and ScoreCluster stages, which does not meet our needs. Therefore, we aim to enhance the extensibility of the Karmada Scheduler while also meeting our requirements.

A proposal has already been made to achieve this capability, and it also provides some ideas for user-defined scheduling strategies and thoughts on splitting multi-template resources. Currently, Karmada does not support any multi-template resources. We hope to enhance the extensibility of Karmada, including but not limited to managing future custom plugins in a manner similar to Kubernetes scheduler-plugins, allowing users to submit their custom strategies and plugins to Karmada, rather than being limited to the current Duplicated and Divided strategies.

Current Situation

Capabilities we have achieved:

Pausing the dispatch of ResourceBinding (Pod, Deployment, Volcano Job) to sub-clusters
Dispatching tasks to sub-clusters based on the priority of each ResourceBinding (Pod, Deployment, Volcano Job)
Job status synchronization (from sub-cluster to control plane cluster)
Splitting single-template tasks (e.g., mindspore-cpu)
Queue capabilities (Pod, Deployment, Volcano Job)

Capabilities being implemented:

Single-cluster scheduling of multi-template tasks
Splitting capabilities of multi-template tasks
Queue sorting (share value)
Queue capacity management
Gang scheduling (supporting MinAvailable minimum replicas)
Selecting appropriate clusters for scheduling

Additional context

No response

volcano-sh / volcano