Open j4ckstraw opened 4 months ago
We observed a steep drop of batch-cpu allocatable.
metric koordlet_node_resource_allocatable{resource="kubernetes.io/batch-cpu",node=~"$node"}/1000
One pod with 10cores normal cpu requested scheduled on the node at the time of problem, and batch-cpu usage had no significant changes. After investigation, I thought it's related to batch resource allocatable calculation.
Here's my question: why add up HPUsed with pod request if no metric found, how about just skip it?
Here's my question: why add up HPUsed with pod request if no metric found, how about just skip it?
@j4ckstraw If the HP pod has no metric found but can show in the PodList (e.g. pod is newly created), it should not cause a steep drop of batch allocatable since the HPRequest also increases with the request of the HP pod. The drop could be due to an HP pod having metric but not showing in the PodList (e.g. pod is deleted). We can skip the pod request if it is deleted, but cannot assure if it is dangling and keeps running on the node.
As discussed with @j4ckstraw offline, the current calculation formula does not consider the HP Request when calculatePolicy="usage"
, so the steep drop issue does exist.
Furthermore, this fluctuation can cause unexpected eviction when we are also using the BECPUEvict strategy with policy="evictByAllocatable"
. That is the real concern from @j4ckstraw.
However, IMO, the batch allocatable's decreasing when a new HP pod is newly created could help mitigate the problem where too many batch pods are just scheduled at that time. So the issues can be separately resolved by the following:
podWarmupDurationSeconds
and podWarmupReclaimPercent
in the ColocationStrategy for the pod warm-up/cold-start cases, which can adjust the weights of the usage who has no reported pod metric or the pod is just starting with inaccurate metrics, differing to the long-time metrics. e.g. podWarmupReclaimPercent=0
to ignore the missing-metric pods.
_Originally posted by @j4ckstraw in https://github.com/koordinator-sh/koordinator/pull/1559#discussion_r1495421865_