koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
Apache License 2.0
1.3k stars 322 forks source link

[BUG] deviceshare plugin not handle AddPod\RemovePod correctly #1959

Open buptcozy opened 5 months ago

buptcozy commented 5 months ago

What happened:

in generally, when we execute AddPod logic here, the pod may be in scheduling status,it won't exist in nodeDeviceCache's used map, so there is a bug that when the framework execute RunFilterPluginsWithNominatedPods with AddPod for high priority pods, the plugin can't reserve resource for hese high priority pods, In RDMA\VF\nv-switch scenario, it can cause high priority pods assign fail due to some resources is assigned by low priority pods. So we reused the "Reserve" logic to generate an assign placement and save it in nominator. We will clear the nominator cache In "Reserve" and "UnReserve", which means we will do clean job no matter assign success or not, this is the same process of the origin k8s framework nominate process.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:


ZiMengSheng commented 5 months ago

Problem Description

Let's analysize by following example.

  1. PodA with low priority, requested 8 gpu, scheduled to node1.
  2. PodB with high priority, requested 4 gpu, preempt PodA, expected to reserve 0-3 gpu, status.nominatedNodeName updated to node1, PodB enter into backoffQ.
  3. PodC with mid priority enter into scheduing cycle, requested 4 gpu, scheduled to node1 without considering PodB preemption result, so it may use 0-3 gpu unexpectedly.

Suggested Proposal

Let's design by the following examples.


  1. PodA with low priority, requested 8 gpu, scheduled to node1
  2. PodB with high priority, requested 4 gpu, preempt PodA, invoke ReserveNominatedPod(PodB) to reserve PodB's nominated resource: 0-3 gpu, status.nominatedNodeName updated to node1, enter into backoffQ.
  3. PodC with mid priority enter into scheduling cycle, requested 4 gpu. In filter phase, framework will invoke RunPreFilterExtensionAddPod(higher priorioty pod such as PodB). We have the chance to make PodB's nonimated resource reserved in current scheduling cycle here. So PodC can't use 0-3 gpu.


  4. PodA with low priority, requested 8 gpu, scheduled to node1
  5. PodB with high priority, requested 4 gpu, preempt PodA, invoke ReserveNominatedPod(PodB) to reserve PodB's nominated resource: 0-3 gpu, status.nominatedNodeName updated to node1, enter into backoffQ.
  6. PodC with high+ priority, requested 4 gpu, scheduled to node1, normally allocated resource: 0-3 gpu. It is overlap with PodB's nominated resource. So, here, we need to invalidate PodB's outdated nominated resource. This make this fix a best-effort.

Scheduling Interpretability

  1. We need sufficient metric or debug service to help us diagnosize and illustrate to users when pod is pending.
songtao98 commented 1 month ago

/milestone someday

koordinator-bot[bot] commented 1 month ago

@songtao98: You must be a member of the koordinator-sh/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to [this](https://github.com/koordinator-sh/koordinator/issues/1959#issuecomment-2230691874): >/milestone someday Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.