koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.31k stars 327 forks source link

[proposal] Extended resource RDMA resource registration, scheduling, and allocation #2187

Open ferris-cx opened 4 weeks ago

ferris-cx commented 4 weeks ago

Koordinator does not currently support RDMA PF or VF (RDMA virtual network Card) resource management and scheduling allocation. But Koordinator seems to have supported RDMA scheduling, and also implemented allocation algorithms, just allocation algorithms. As far as I know, there is a change in this framework: the allocation algorithm is transferred from the previous kubelet to the scheduling center. Kubelet is responsible for reading the VF device ID from the labeled pod when creating the pod, and then calls the Allocate method of device-plugin to complete the final VF allocation. The entire framework needs to implement:

  1. dp plug-in for RDMA PF/VF devices
  2. Extend the scheduling plug-in to support RDMA
  3. . Extend the scheduling plug-in to support the joint scheduling of RDMA, GPU, CPU, Memory and other resources
  4. .RDMA PF/VF device allocation algorithm, that is, select which PF/VF, and then mark pod (note: this is not device allocation)
  5. When kubelet creates a pod, it parses the pod marked pf/vf device ID, and then calls device-plugin to assign VF devices
  6. After the pod applies to the GPU device, specify the GPU to use VF in step 5 when performing aggregate network communication. This should involve the modification of the CNI plug-in
  7. If a Pod applies for multiple Gpus, such as two Gpus, and applies for two VF at the same time, it is hoped that GPU-1 can use VF1 set communication, and GPU-2 can use VF2 set communication, that is, GPU and VF are 1:1 pairing. It is hoped that the community can consider the above requirements and promote the progress of open source functions.