koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.36k stars 331 forks source link

The center side scheduling results and device allocation results make dp aware of lightweight methods #2194

Open ferris-cx opened 2 months ago

ferris-cx commented 2 months ago

Since both the scheduling result and the assignment device are implemented in the scheduling plug-in, the scheduling result is marked in the pod Annotations. And dp is not aware of scheduling results. When Pod is created, Kubelet calls the Allocate method of dp. Since kubelet's native code is not aware of the scheduling result, it is kubelet itself that calculates the device ID according to the allocation algorithm. The device ID on Annotations cannot be obtained based on PodName (the scheduler has assigned it). For the above reasons, a node lock scheme can be considered: Verify that the Pod has device resource application. If so, write the node lock in the Bind phase of the Pod to ensure that only one Pod preempt the lock. The dp side can query the Pod that is Pennding on the current Node, and analyze the device ID list assigned by the center side from the Pod Annotations. Further processing is performed and the AllocateResponse return is finally built. In the GPU scheduling scenario, a server has a maximum of eight GPU cards, so the number of Pods for each server is small, and the performance loss caused by frequent creation is small. Therefore, this option can be considered. The general code idea:

  1. On the dispatch center side, after successful scheduling and assignment results are available, call in Bind: current, err := kubeClient.CoreV1().Pods(args.PodNamespace).Get(context.Background(), args.PodName, metav1.GetOptions{})// Gets the current pod object LockNode(node, current)// Adds a node lock
  2. Allocate(ctx context.Context, Reqs kubeletdevicepluginv1beta1 AllocateRequest) ( kubeletdevicepluginv1beta1 AllocateResponse, error) method: current, err := util.GetPendingPod(nodename)// Traverses all Pods of the current Node and finds the Pending Pods
  3. Resolve the assigned device ID according to current.Annotations.

Please consider this solution and if feasible, consider implementing it

ZiMengSheng commented 2 months ago

Will the Scheduling Binding Cycle wait or failed if it can't get the lock.

ferris-cx commented 2 months ago

Will the Scheduling Binding Cycle wait or failed if it can't get the lock.

Means binding failed