koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.25k stars 315 forks source link

[question] batch resource calculation fluctuation #1906

Open j4ckstraw opened 4 months ago

j4ckstraw commented 4 months ago
          why use Request if has no metric, how about skip?

_Originally posted by @j4ckstraw in https://github.com/koordinator-sh/koordinator/pull/1559#discussion_r1495421865_

j4ckstraw commented 4 months ago

We observed a steep drop of batch-cpu allocatable.

metric koordlet_node_resource_allocatable{resource="kubernetes.io/batch-cpu",node=~"$node"}/1000

image

One pod with 10cores normal cpu requested scheduled on the node at the time of problem, and batch-cpu usage had no significant changes. After investigation, I thought it's related to batch resource allocatable calculation.

j4ckstraw commented 4 months ago

Here's my question: why add up HPUsed with pod request if no metric found, how about just skip it?

saintube commented 4 months ago

Here's my question: why add up HPUsed with pod request if no metric found, how about just skip it?

@j4ckstraw If the HP pod has no metric found but can show in the PodList (e.g. pod is newly created), it should not cause a steep drop of batch allocatable since the HPRequest also increases with the request of the HP pod. The drop could be due to an HP pod having metric but not showing in the PodList (e.g. pod is deleted). We can skip the pod request if it is deleted, but cannot assure if it is dangling and keeps running on the node.

saintube commented 4 months ago

As discussed with @j4ckstraw offline, the current calculation formula does not consider the HP Request when calculatePolicy="usage", so the steep drop issue does exist. Furthermore, this fluctuation can cause unexpected eviction when we are also using the BECPUEvict strategy with policy="evictByAllocatable". That is the real concern from @j4ckstraw. However, IMO, the batch allocatable's decreasing when a new HP pod is newly created could help mitigate the problem where too many batch pods are just scheduled at that time. So the issues can be separately resolved by the following:

  1. [ ] To reduce the unexpected eviction in the BECPUEvict strategy, which is the real problem, add the calculation logic of batch allocatable in the koordlet (refer to BECPUSuppress), and the QoS plugins (e.g. BECPUEvict) should retrieve this real-time result instead of look up the node.status.allocatable since it is always lagged.
  2. [ ] To smooth the batch allocatable calculation of the slo-controller, add parameters podWarmupDurationSecondsand podWarmupReclaimPercent in the ColocationStrategy for the pod warm-up/cold-start cases, which can adjust the weights of the usage who has no reported pod metric or the pod is just starting with inaccurate metrics, differing to the long-time metrics. e.g. podWarmupReclaimPercent=0 to ignore the missing-metric pods.