kubewharf / godel-scheduler

a unified scheduler for online and offline tasks
Apache License 2.0
377 stars 58 forks source link

[Enhancement] Alleviate optimistic concurrent scheduling conflict rates. #50

Open NickrenREN opened 1 month ago

NickrenREN commented 1 month ago

For now, if we have several scheduler components, they will work concurrently. In some scenarios, the probability of conflicts can be relatively high, such as: high deployment water level, batch scheduling... We need to optimize godel scheduler to alleviate the optimistic concurrency conflicts

binacs commented 1 month ago

1. Description

Godel Scheduler (https://github.com/kubewharf/godel-scheduler) is a distributed parallel scheduler built on shared state architecture and optimistic concurrency ideas. When multiple Scheduler shards work in parallel, each Scheduler shard has a complete view of cluster resources.

It is obvious that the scheduling decisions made by different Scheduler shards at a certain moment may conflict with each other (resource conflicts of a single node / topological domain affinity conflicts / etc.). This led to the introduction of a centralized Binder to resolve conflicts through a serial verification process. When the number of scheduler shards increases / the cluster resource level is extremely high / the number of Unit Pods is extremely large, the probability of conflict will increase significantly, resulting in a large number of invalid scheduling attempts and even ping-pong between components.

Previously, we introduced the Node Partition mechanism to reduce conflicts by constraining the resource perspectives of different shard schedulers (prioritizing scheduling in In Partition Nodes), but this will bring a certain degree of scheduling quality loss. We look forward to exploring other ways to better handle such conflict situations.

This problem requires a deep understanding of the multi-sharding architecture of the scheduler to further alleviate the probability of scheduler conflicts in various scenarios. Ultimately, it will improve the operating efficiency of the entire system.

2. Tasks

3. Skill requirements and programming languages

4. Expected results

Complete corresponding solution design and code implementation


1. 题目描述

Godel Scheduler (https://github.com/kubewharf/godel-scheduler) 是基于共享状态架构与乐观并发思想构建的分布式并行调度器。当有多个 Scheduler 分片并行工作时,每一个 Scheduler 分片都拥有完整的集群资源视角。

显而易见的是,不同的 Scheduler 分片在某一时刻作出的调度决策可能是彼此冲突的 (单个节点的资源冲突 / 拓扑域亲和性冲突 / etc.)。由此引入了中心化的 Binder,通过串行校验过程来解决冲突。当 调度器分片数增加 / 集群资源水位极高 / Unit Pods 数量极大 的情况下,冲突的概率都将显著增长,并由此带来大量无效的调度尝试甚至产生组件间的 ping-pong。

此前,我们引入了 Node Partition 机制以通过约束不同分片调度器资源视角的方式来降低冲突 (优先在 In Partition Nodes 中调度),但这会带来一定程度的调度质量损耗。我们期望通过探索其他方式更好地处理此类冲突情况。

本题目需要基于对调度器多分片架构深入理解的基础上,进一步缓解各类场景下的调度器冲突概率。最终提升整个系统的运行效率。

2. 编码任务

3. 技能要求和编程语言

4. 预期完成结果

完成相应方案设计与代码实现

ipsum-0320 commented 4 weeks ago

hello, i want to have a try @binacs

ipsum-0320 commented 4 weeks ago

If you are willing to tell me more information related to this project, I would be extremely grateful, or is there an online meeting for communication on this project?@binacs @NickrenREN

NickrenREN commented 3 weeks ago

@ipsum-0320 thanks for you interest, actually, this will be a task of 2024 GLCC, you may go to https://www.gitlink.org.cn/glcc for more information.