the kubelet should respect the pod priority when initializing after restart

ffromani commented 1 year ago

What happened?

this is a followup of https://github.com/kubernetes/kubernetes/issues/109595 We improved the kubelet behavior by making the devicemanager actually deliver the requested devices, failing admission otherwise. The workload/controllers can now detect the inconsistencies and retry rather than silently crash.

However looking deeply at how kubelet initializes itself after restart, we see the kubelet is not taking into account the pod priorities:

Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193478    2011 config.go:278] "Setting pods for source" source="api"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193526    2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-node-tuning-operator/tuned-tql4x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193541    2011 config.go:383] "Receiving a new pod" pod="openshift-sdn/sdn-fnls4"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193549    2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/node-exporter-jzprg"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193557    2011 config.go:383] "Receiving a new pod" pod="openshift-ingress-canary/ingress-canary-62kq2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193565    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-87sbt"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193573    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/network-metrics-daemon-5qb5q"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193580    2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/node-ca-frkhr"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193597    2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193605    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193615    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193624    2011 config.go:383] "Receiving a new pod" pod="openshift-machine-config-operator/machine-config-daemon-nkqsq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193632    2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/image-pruner-27496800-n4qzq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193640    2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193649    2011 config.go:383] "Receiving a new pod" pod="openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193659    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-additional-cni-plugins-xwgkj"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193667    2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/coredns-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193674    2011 config.go:383] "Receiving a new pod" pod="openshift-dns/node-resolver-6n2vp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193682    2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/keepalived-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193689    2011 config.go:383] "Receiving a new pod" pod="openshift-dns/dns-default-cnlrl"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193695    2011 config.go:383] "Receiving a new pod" pod="default/testpmd"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193703    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193711    2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-network-config-daemon-p94vm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193721    2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-device-plugin-bw7rq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193728    2011 config.go:383] "Receiving a new pod" pod="openshift-network-diagnostics/network-check-target-2tbjs"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.195257    2011 kubelet.go:2106] "SyncLoop ADD" source="api" pods=[openshift-cluster-node-tuning-operator/tuned-tql4x openshift-sdn/sdn-fnls4 openshift-monitoring/node-exporter-jzprg openshift-ingress-canary/ingress-canary-62kq2 openshift-multus/multus-87sbt openshift-multus/network-metrics-daemon-5qb5q openshift-image-registry/node-ca-frkhr openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2 openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp openshift-machine-config-operator/machine-config-daemon-nkqsq openshift-image-registry/image-pruner-27496800-n4qzq openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6 openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x openshift-multus/multus-additional-cni-plugins-xwgkj openshift-openstack-infra/coredns-TESTNODE openshift-dns/node-resolver-6n2vp openshift-openstack-infra/keepalived-TESTNODE openshift-dns/dns-default-cnlrl default/testpmd openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm openshift-sriov-network-operator/sriov-network-config-daemon-p94vm openshift-sriov-network-operator/sriov-device-plugin-bw7rq openshift-network-diagnostics/network-check-target-2tbjs]

Some random examples:

sriov-device-plugin-22z44    system-node-critical    2000001000

yet processed after

node-exporter-dhln9    system-cluster-critical    2000000000

and even after

collect-profiles-*    openshift-user-critical    1000000000

(please just note the priority, lower than system-node-critical)

EDIT 20230605 this is because the kubelet actually sorts the pods it receives, but by creation time. It should sort pods first by priority, then by creation time.

What did you expect to happen?

when initializing after restart, thus when effectively recovering the node state, the kubelet should process pods in decreasing priority order, start from highest priority down to the lowest priority. EDIT 20230605 Within the same priority, the kubelet should keep sorting the pods by creation time

This minimizes the disruptions, minimizes or avoid admission errors and thus minimizes the pod downtime.

How can we reproduce it (as minimally and precisely as possible)?

The pod priority is not taken into account when kubelet initializes, EDIT 20230605 but only by creation time. The behavior observed from logs, and a cursory look at code.

Anything else we need to know?

EDIT 2030605 obsolete ~~I'm not sure this is a bug or an improvement. Looking at how kubelet handles the config sources~~ seems to suggests there it is (used to be?) a reason to relay events from apiserver in order. However in the important corner case of kubelet initializing after a restart, I don't think this applies because updates are effectively received at the same time, so honoring pod priority looks the best thing.

This fix should compose nicely with admission improvements mentioned in https://github.com/kubernetes/kubernetes/issues/109595#issuecomment-1540565113

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider

Not relevant

OS version

Not relevant

Install tools

Not relevant

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

ffromani commented 1 year ago

/sig node

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

ffromani commented 1 year ago

/cc @smarterclayton @bobbypage

as they are working to improve the kubelet behavior close to this area

ffromani commented 1 year ago

/cc @swatisehgal

(node/resource management)

ffromani commented 1 year ago

VERY rough PoC of what I meant in the issue description. A fix could look like:

diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index 709e35015fb..076db5070a5 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -2518,7 +2518,7 @@ func handleProbeSync(kl *Kubelet, update proberesults.Update, handler SyncHandle
 // a config source.
 func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
        start := kl.clock.Now()
-       sort.Sort(sliceutils.PodsByCreationTime(pods))
+       sort.Sort(sliceutils.PodsByPriority(pods))
        if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
                kl.podResizeMutex.Lock()
                defer kl.podResizeMutex.Unlock()
diff --git a/pkg/kubelet/util/sliceutils/sliceutils.go b/pkg/kubelet/util/sliceutils/sliceutils.go
index ab341a42a78..75a5f116dd1 100644
--- a/pkg/kubelet/util/sliceutils/sliceutils.go
+++ b/pkg/kubelet/util/sliceutils/sliceutils.go
@@ -37,6 +37,41 @@ func (s PodsByCreationTime) Less(i, j int) bool {
        return s[i].CreationTimestamp.Before(&s[j].CreationTimestamp)
 }

+// PodsByPriority makes an array of pods sortable by their priority
+// in descending order, then by their creation timestamps in
+// ascending order
+type PodsByPriority []*v1.Pod
+
+func (s PodsByPriority) Len() int {
+       return len(s)
+}
+
+func (s PodsByPriority) Swap(i, j int) {
+       s[i], s[j] = s[j], s[i]
+}
+
+func (s PodsByPriority) Less(i, j int) bool {
+       iPrio := getPodPriority(s[i])
+       jPrio := getPodPriority(s[j])
+       if iPrio < jPrio {
+               return true
+       }
+       if iPrio == jPrio {
+               return s[i].CreationTimestamp.Before(&s[j].CreationTimestamp)
+       }
+       return false
+}
+
+func getPodPriority(pod *v1.Pod) int32 {
+       if pod == nil {
+               return 0
+       }
+       if pod.Spec.Priority == nil {
+               return 0
+       }
+       return *pod.Spec.Priority
+}
+
 // ByImageSize makes an array of images sortable by their size in descending
 // order.
 type ByImageSize []kubecontainer.Image

SataQiu commented 1 year ago

/cc

ffromani commented 1 year ago

I managed to create a small PoC and while this seems to work, it doesn't help really in the kubelet initialization flow, because the flow doesn't take into account dependencies between pods. Example of a key dependency which exists today is a device plugin: pod consuming the device plugin should wait for the devices to be available (which happens later and asynchronously wrt the pod admission) before to attempt admission, otherwise it will obviously fail.

As it stands today, this improvement seems too minor to invest time on it, hence I'm closing it for now.

kubernetes / kubernetes