koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.37k stars 333 forks source link

[proposal] Provide an evolvable End to End Solution for Koordinator Device Management #2181

Open ZiMengSheng opened 3 months ago

ZiMengSheng commented 3 months ago

What is your proposal:

Provide an evolvable End to End Solution for Koordinator Device Management

Why is this needed:

Koordinator already supports two functions in the scheduler: GPU shared scheduling and GPU & RDMA joint allocation. It supports users to apply for GPU or RDMA resources using kubrenetes extended resources and Hints defined on Pod Annotation. The extended resource method was originally introduced into Kubernetes mainly to describe discrete and countable node resources. The Kubelet Device Plugin interface is the main way for the Kubernetes community to support such resource reporting and allocation.

However, the allocation logic of Kubelet Device Manager does not support the refined joint allocation of multiple resources according to the device topology, such as the scenario where GPU and RDMA need to be allocated under a PCIESwitch. The only topology allocation supported by Kubelet is allocation according to NUMA. However, even in the scenario where only NUMA allocation is required, Kubelet intervenes a little late. Users will have to face performance degradation due to topology mismatch after pod has been scheduled.

To solve this problem, Koordinator moved the device allocation logic from Kubelet to the scheduler, and used cri-runtime-proxy on the node side to set up device isolation and visibility. However, the cri-runtime-proxy approach is indeed heavy and inconvenient to install. In addition, although the Koordinator scheduler provides the GPU and RDMA joint allocation function, there is no end-to-end solution available overall, especially on the node side, it has not yet been connected to the community standard RDMA logic. This proposal attempts to solve the above problems for Koordinator and provide an end-to-end feasible solution.

Finally, in the field of device management, the community proposed Dynamic Resource Allocation after the Device Plugin interface to overcome the various limitations of the current Device Plugin solution. This proposal will also show how Koordintor's GPU sharing and GPU & RDMA joint allocation are implemented under the DRA mode, and how the current solution evolves to DRA.

Key Results:

ZiMengSheng commented 3 months ago

/area koord-scheduler

saintube commented 2 months ago

ref #2187 GPU & RDMA Joint Allocation

ZiMengSheng commented 2 months ago

ref #2171 GPU 监控无法感知中心调度结果

ZiMengSheng commented 2 months ago

ref #583 GPU 共享隔离方案

ZiMengSheng commented 2 months ago

/area koordlet /area koord-manager

wawa0210 commented 1 month ago

I am the maintainer of HAMi and I look forward to in-depth cooperation with koordinator in the area of device management.

ZiMengSheng commented 1 month ago

I am the maintainer of HAMi and I look forward to in-depth cooperation with koordinator in the area of device management.

Hello, I‘m the issue planner, nice to meet you in github! I have some questions abount HAMI:

  1. Through which component does HAMI expose GPU utilization metrics?
  2. Do GPUs from different manufacturers use the same DP?
  3. How does HAMI core participate in GPU isolation?
  4. Is HAMI core a library or environment variable? Is it done through the DP standard interface or through a customized Container Runtime?