koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.36k stars 331 forks source link

The center side scheduling results and device allocation results make dp aware of lightweight methods #2198

Open ferris-cx opened 2 months ago

ferris-cx commented 2 months ago

Summary

This proposal provides a mechanism for the end-side dp to sense that the pod is being scheduled on the current node, get the device id according to the Pod annotation, and allocate resources.

Motivation

Since both the scheduling result and the assignment device are implemented in the scheduling plug-in, the scheduling result is marked in the pod Annotations. And dp is not aware of scheduling results. When Pod is created, Kubelet calls the Allocate method of dp. Since kubelet's native code is not aware of the scheduling result, it is kubelet itself that calculates the device ID according to the allocation algorithm. The device ID on Annotations cannot be obtained based on PodName (the scheduler has assigned it). For the above reasons, we need to consider a mechanism that makes the dp aware of the dispatcher's device allocation results.

Goals

Ensure that dp can allocate device resources(GPU、RDMA etc) according to Pod scheduling sequence.

Proposal

Locking time-Scheduler.Bind()

The overall flow chart of locking

加锁Bind流程图

Detailed description of the process

1.Node locks are executed when Pod scheduling is successful and resources are allocated, that is, during the Bind phase of the scheduling extension;

2.Gets the Pod object based on the Pod name;

3.Execute the node lock method;

4.To determine whether the lock was successful, if success method returned "extenderv1. ExtenderBindingResult {Error: ""}" meanig BindingSucceed,else ReleaseNodeLock then return ExtenderBindingResult{Error: err.Error()) meanig BindingFailed.

LockNode method flow chart

加锁方法-LockNode流程图

Detailed description of the process

1.Get the node object by the node name;

2.Check whether the node object is successfully obtained;

3.If the second step returns false,the entire method simply returns;otherwise the next step;

4.Gets the annotationon from the node object, using the lock name as the key;

5.Determines whether the annotation exists;

6.If the annotation(from the previous step) does not exist, call the setNodeLock method directly; Otherwise the next step;

7.Parse the annotation value, which is a timestamp;

8.Determine whether the timestamp is longer than 5 minutes, that is, whether the lock has timed out;

9.If there is no timeout(from the previous step), simply return an error because the lock has not been released;Otherwise the next step;

  1. Call the unlocking method;
  2. Check whether the lock is released successfully.If the lock fails to be released, an error is returned.Otherwise, the SetNodeLock method is invoked and the node lock is added.
SetNodeLock method flow chart

加锁方法-SetNodeLock流程图

Detailed description of the process

1.Get the node object by the node name;

2.Check whether the node object is successfully obtained;

3.If the second step returns false,the entire method simply returns;otherwise the next step;

4.Gets the annotationon from the node object, using the lock name as the key;

5.Determines whether the annotation exists;

6.If the annotation already exists, the method simply returns the error "Lock occupied".Otherwise set annotations, key is fixed and value is the system timestamp;

7.Update Node object information;

8.Determine whether the node annotation update was successful.If it succeeds, the lock is taken and the method returns directly;If the update fails, retry the failure up to 5 times and each retry interval is 100 milliseconds.

9.The logic of the retry is exactly the same as the locking process above: get the node object, set the comment, modify the node object.

Unlocking time-dp#Allocate()

1.When a Pod is created on the end side, the device allocation interface of dp is called Allocate. At this time Allocate reads the latest Pending status Pod scheduled on the current node;

2.If the Pod cannot be obtained, the node lock is released.

3.The list of requested devices is parsed according to the pod annotation;If annotation parsing fails, the node lock is released.

4.The rest of the logic stays the same as before.