Closed ffromani closed 1 year ago
/sig node
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
/cc @smarterclayton @bobbypage
as they are working to improve the kubelet behavior close to this area
/cc @swatisehgal
(node/resource management)
VERY rough PoC of what I meant in the issue description. A fix could look like:
diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index 709e35015fb..076db5070a5 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -2518,7 +2518,7 @@ func handleProbeSync(kl *Kubelet, update proberesults.Update, handler SyncHandle
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
start := kl.clock.Now()
- sort.Sort(sliceutils.PodsByCreationTime(pods))
+ sort.Sort(sliceutils.PodsByPriority(pods))
if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
kl.podResizeMutex.Lock()
defer kl.podResizeMutex.Unlock()
diff --git a/pkg/kubelet/util/sliceutils/sliceutils.go b/pkg/kubelet/util/sliceutils/sliceutils.go
index ab341a42a78..75a5f116dd1 100644
--- a/pkg/kubelet/util/sliceutils/sliceutils.go
+++ b/pkg/kubelet/util/sliceutils/sliceutils.go
@@ -37,6 +37,41 @@ func (s PodsByCreationTime) Less(i, j int) bool {
return s[i].CreationTimestamp.Before(&s[j].CreationTimestamp)
}
+// PodsByPriority makes an array of pods sortable by their priority
+// in descending order, then by their creation timestamps in
+// ascending order
+type PodsByPriority []*v1.Pod
+
+func (s PodsByPriority) Len() int {
+ return len(s)
+}
+
+func (s PodsByPriority) Swap(i, j int) {
+ s[i], s[j] = s[j], s[i]
+}
+
+func (s PodsByPriority) Less(i, j int) bool {
+ iPrio := getPodPriority(s[i])
+ jPrio := getPodPriority(s[j])
+ if iPrio < jPrio {
+ return true
+ }
+ if iPrio == jPrio {
+ return s[i].CreationTimestamp.Before(&s[j].CreationTimestamp)
+ }
+ return false
+}
+
+func getPodPriority(pod *v1.Pod) int32 {
+ if pod == nil {
+ return 0
+ }
+ if pod.Spec.Priority == nil {
+ return 0
+ }
+ return *pod.Spec.Priority
+}
+
// ByImageSize makes an array of images sortable by their size in descending
// order.
type ByImageSize []kubecontainer.Image
/cc
I managed to create a small PoC and while this seems to work, it doesn't help really in the kubelet initialization flow, because the flow doesn't take into account dependencies between pods. Example of a key dependency which exists today is a device plugin: pod consuming the device plugin should wait for the devices to be available (which happens later and asynchronously wrt the pod admission) before to attempt admission, otherwise it will obviously fail.
As it stands today, this improvement seems too minor to invest time on it, hence I'm closing it for now.
What happened?
this is a followup of https://github.com/kubernetes/kubernetes/issues/109595 We improved the kubelet behavior by making the devicemanager actually deliver the requested devices, failing admission otherwise. The workload/controllers can now detect the inconsistencies and retry rather than silently crash.
However looking deeply at how kubelet initializes itself after restart, we see the kubelet is not taking into account the pod priorities:
Some random examples:
yet processed after
and even after
(please just note the priority, lower than
system-node-critical
)EDIT 20230605 this is because the kubelet actually sorts the pods it receives, but by creation time. It should sort pods first by priority, then by creation time.
What did you expect to happen?
when initializing after restart, thus when effectively recovering the node state, the kubelet should process pods in decreasing priority order, start from highest priority down to the lowest priority. EDIT 20230605 Within the same priority, the kubelet should keep sorting the pods by creation time
This minimizes the disruptions, minimizes or avoid admission errors and thus minimizes the pod downtime.
How can we reproduce it (as minimally and precisely as possible)?
The pod priority is not taken into account when kubelet initializes, EDIT 20230605 but only by creation time. The behavior observed from logs, and a cursory look at code.
Anything else we need to know?
EDIT 2030605 obsolete
I'm not sure this is a bug or an improvement. Looking at how kubelet handles the config sourcesseems to suggests there it is (used to be?) a reason to relay events from apiserver in order. However in the important corner case of kubelet initializing after a restart, I don't think this applies because updates are effectively received at the same time, so honoring pod priority looks the best thing.This fix should compose nicely with admission improvements mentioned in https://github.com/kubernetes/kubernetes/issues/109595#issuecomment-1540565113
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)