kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.96k stars 39.63k forks source link

GracefulNodeShutdown fail to update Pod status for system critical pods. #124448

Open avestuk opened 6 months ago

avestuk commented 6 months ago

What happened?

Pods with a system-cluster-critical priority class set are not evicted by Graceful node shutdown when a node is slated to be shutdown. Instead they are evicted by the Taint Manager once the tolerationSeconds period expires.

What did you expect to happen?

I'd expect the presence of priority classes to be orthogonal to whether or not Graceful node shutdown evicts the pods or not. If graceful node shutdown runs on node shutdown then pods should be evicted before Taint Manager runs.

How can we reproduce it (as minimally and precisely as possible)?

Run a deployment with the below tolerations

priorityClass: system-cluster-critical

Schedule a node shutdown or trigger a shutdown event using gdbus

gdbus emit --system --object-path /org/freedesktop/login1 --signal 'org.freedesktop.login1.Manager.PrepareForShutdown' true

Observe that pods from the deployment enter a "Terminating" state and have conditions indicating they were evicted by the Taint manager

  - lastProbeTime: null
    lastTransitionTime: "2024-04-22T14:30:26Z"
    message: 'Taint manager: deleting due to NoExecute taint'
    reason: DeletionByTaintManager
    status: "True"
    type: DisruptionTarget

The pod status is never updated to show TerminationByKubelet and stays in Running state until the eviction by Taint Manager

Anything else we need to know?

Kubelet logs will indicate that the node shutdown manager finished terminating the pod but this doesn't get reflected in the API.

shutdown_manager_linux.go:395] "Shutdown manager finished killing pod" pod="kube-system/coredns-5687d5bddb-fqlbh"

Pod in the API server

coredns-5687d5bddb-fqlbh                  1/1     Terminating   0          89m

Kubernetes version

```console $ kubectl version Client Version: v1.29.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.3 ```

Cloud provider

Akamai Connected Cloud (The cloud formerly known as Linode)

OS version

```console # On Linux: $ cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" $ uname -a Linux lke-e2e-bcbc70-test-1a751f3d 6.1.0-0.deb11.17-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1~bpo11+1 (2024-01-05) x86_64 GNU/Linux ```

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

avestuk commented 6 months ago

/sig node

k8s-ci-robot commented 6 months ago

@avestuk: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/kubernetes/issues/124448#issuecomment-2070036030): >/sig sig-node Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
avestuk commented 6 months ago

/sig node

saschagrunert commented 6 months ago

cc @kwilczynski

I'm wondering if we can state this as bug or should be handled as enhancement to the graceful shutdown feature.

zhifei92 commented 6 months ago

This could be due to the kubelet receiving an shutdown event of node, leading it to delete the pod, while the KCM (kube-controller-manager) initiates pod eviction after a 30-second delay. By the time the KCM starts the eviction process, the node has already shut down, but the API server has not yet been updated with the information that the pod has been deleted, so the pod is always in the Terminating state.

If I understand correctly, I do think it can be improved.

AnishShah commented 6 months ago

Graceful node shutdown feature does not evict pods based on taints & tolerations. It first evicts all the pods and then the system pods. In this case, coredns might be running as a system pod?

AnishShah commented 6 months ago

/triage needs-information

avestuk commented 6 months ago

@AnishShah I found yesterday that my intital report was spurious, this does seem to be due to priority classes and CoreDNS was indeed running as a system-cluster-critical pod and so is falling into the ShutdownGracePeriodCriticalPods timeout.

I'm able to consistently reproduce the failure using a minimal example deployment, where when the PriorityClassName for the deployment is set to "system-cluster-critical", then NodeGracefulShutdown fails.

Below are logs from a successful pod eviction due to graceful node shutdown. The pod is part of a deployment without a PriorityClassName specified.

Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621576     718 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-68969998d9-7nqcn" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621746     718 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="kill"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621834     718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621926     718 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622017     718 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622101     718 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622188     718 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622262     718 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-68969998d9-7nqcn"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622415     718 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-68969998d9-7nqcn" oldPhase="Running" phase="Running"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622546     718 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containers="(busybox state=running previous=<none>)"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622648     718 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622757     718 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePerio>
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622864     718 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623182     718 status_manager.go:227] "Syncing updated statuses"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623306     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623567     718 event.go:376] "Event occurred" object="kube-system/test-68969998d9-7nqcn" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"

Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.630224     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.632487     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.729875     718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:           &v1.PodStatus{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Phase: "Running",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Phase: "Failed",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Conditions: []v1.PodCondition{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           ... // 3 identical elements
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 {
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Type:               "DisruptionTarget",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Status:             "True",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Message:           "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Reason:            "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Reason:            "Terminated",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   NominatedNodeName: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   HostIP:            "192.168.161.11",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ... // 2 identical fields
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   StartTime:                  s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         InitContainerStatuses:      nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   QOSClass:                   "BestEffort",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Resize:                     "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ResourceClaimStatuses:      nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:           }
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:  >

Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.730751     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735733     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735835     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.730445     718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:           &v1.PodStatus{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Phase: "Running",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Phase: "Failed",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Conditions: []v1.PodCondition{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           ... // 3 identical elements
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 {
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Type:               "DisruptionTarget",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Status:             "True",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Message:           "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Reason:            "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Reason:            "Terminated",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   NominatedNodeName: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   HostIP:            "192.168.161.11",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ... // 2 identical fields
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   StartTime:                  s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         InitContainerStatuses:      nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   QOSClass:                   "BestEffort",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Resize:                     "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ResourceClaimStatuses:      nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:           }
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:  >
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.731178     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.732865     718 kubelet.go:2505] "SyncLoop (housekeeping) end" duration="5ms"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.241307     718 kubelet.go:2885] "Container runtime status" status="Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=true reason: message:, ContainerdHasNoDeprecationWarnings=true reason: message:"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.242771     718 generic.go:224] "GenericPLEG: Relisting"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.243558     718 kuberuntime_manager.go:437] "Retrieved pods from runtime" all=true
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.362998     718 kubelet_node_status.go:538] "Updating node status"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.592638     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.593094     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727400     718 kubelet.go:2466] "SyncLoop (SYNC) pods" total=1 pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727876     718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"

Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.047774     718 kubelet.go:2428] "SyncLoop RECONCILE" source="api" pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049563     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"DisruptionTarget\"},{\"type\":>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049894     718 status_manager.go:883] "Status for pod updated successfully" pod="kube-system/test-68969998d9-7nqcn" statusVersion=4 status={"phase":"Failed","conditions":[{"type":"DisruptionTarget","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-04-30T09:38:37Z","reason":"TerminationByKube>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.050132     718 status_manager.go:833] "Sync pod status" podUID="8999dbb6-9b04-4300-acde-41e81f01c62a" statusUID="8999dbb6-9b04-4300-acde-41e81f01c62a" version=1

Now logs from the same test but with the Priority Class name set to system-cluster-critical. I believe that the crucial difference is that in the below example we never see a sucess message from the status_manager where the pod condidtions are set to reflect the TerminationByKubelet state.

Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.028743    1278 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-7db8c76ddc-mnjmg" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030611    1278 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="kill"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030688    1278 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030769    1278 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030852    1278 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030933    1278 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031024    1278 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031102    1278 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-7db8c76ddc-mnjmg"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031201    1278 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-7db8c76ddc-mnjmg" oldPhase="Running" phase="Running"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031306    1278 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containers="(busybox state=running previous=<none>)"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031461    1278 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031635    1278 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeri>
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031738    1278 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031877    1278 event.go:376] "Event occurred" object="kube-system/test-7db8c76ddc-mnjmg" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"

Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.846215    1278 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-7db8c76ddc-mnjmg" statusDiff=<
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:           &v1.PodStatus{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Phase: "Running",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Phase: "Failed",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   Conditions: []v1.PodCondition{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           ... // 3 identical elements
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:58 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:40 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                 {
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Type:               "DisruptionTarget",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Status:             "True",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         LastTransitionTime: s"2024-04-30 09:08:11 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                 },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Message:           "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Reason:            "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Reason:            "Terminated",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   NominatedNodeName: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   HostIP:            "192.168.161.11",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ... // 2 identical fields
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   StartTime:                  s"2024-04-30 09:07:40 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         InitContainerStatuses:      nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:07:57 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   QOSClass:                   "BestEffort",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   Resize:                     "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ResourceClaimStatuses:      nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:           }
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:  >

Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.848233    1278 status_manager.go:833] "Sync pod status" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" statusUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" version=3
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.038338    1278 request.go:629] Waited for 190.015469ms due to client-side throttling, not priority and fairness, request: GET:https://a89588f5-edcc-4788-b2ac-d1ed32f7e448.cpc-av8.avest.lkedevs.net:443/api/v1/namespaces/kube-system/pods/test-7db8c76ddc-mnjmg
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043024    1278 status_manager.go:874] "Patch status for pod" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" patch="{\"metadata\":{\"uid\":\"ae416f84-b43f-4490-be95-a7d1b556ae9b\"}}"
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043109    1278 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-7db8c76ddc-mnjmg" statusVersion=3

Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.029637    1278 nodeshutdown_manager_linux.go:413] "Shutdown manager pod killing time out" gracePeriod=10 priority=2000000000
Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.045581    1278 nodeshutdown_manager_linux.go:327] "Shutdown manager completed processing shutdown event, node will shutdown shortly"

My deployment definition:

deployment, err := kubernetes.AppsV1().Deployments("kube-system").Create(ctx, &appsv1.Deployment{
                    ObjectMeta: metaV1.ObjectMeta{
                        Name:   "test",
                        Labels: map[string]string{"foo": "bar"},
                    },
                    Spec: appsv1.DeploymentSpec{
                        Replicas: &replicas,
                        Selector: &metaV1.LabelSelector{
                            MatchLabels: map[string]string{"foo": "bar"},
                        },
                        Template: v1.PodTemplateSpec{
                            ObjectMeta: metaV1.ObjectMeta{
                                Labels: map[string]string{"foo": "bar"},
                            },
                            Spec: v1.PodSpec{
                                PriorityClassName: "system-cluster-critical",
                                Containers: []v1.Container{
                                    {
                                        Name:    "busybox",
                                        Image:   "busybox:stable-glibc",
                                        Command: []string{"sleep"},
                                        Args:    []string{"infinity"},
                                    },
                                },
                            },
                        },
                    },
                }, metaV1.CreateOptions{})
AnishShah commented 6 months ago

What do you mean by NodeGracefulShutdown fails? do you not see a pod status update? Is it possible to share the full kubelet logs?

/retitle NodeGracefulShutdown fail to update Pod status for system critical pods

kwilczynski commented 6 months ago

@avestuk, so this is no longer a problem related to setting up tolerations for some Pods?

Graceful Node Shutdown manager does not have any means to convey that it will terminate some Pods, other than setting up the an appropriate node status and taints, as such, a potential race to evict and/or terminate Pods with the Taint (Eviction?) manager could take place.

The KEP and then the implementation most likely did not take such a scenario into account.

However, you sound as if this is no longer an issue?

kwilczynski commented 6 months ago

@avestuk, for your recent update.

How is your Graceful Node Shutdown configured at the moment? Would you be able to share the relevant configuration?

Also, are you testing on a worker node or a member of the control plane?

There is nothing specific in terms of how the Pod of a certain priority is terminated. The order in which these are terminated is different, but the actual process is the same. Thus, I am not sure why the problem you are seeing only applies to Pods with some priority class set.

_(taken from kubernetes/kubernetes//pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L452)_

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
    groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
    for _, period := range shutdownGracePeriodByPodPriority {
        groups = append(groups, podShutdownGroup{
            ShutdownGracePeriodByPodPriority: period,
        })
    }

    for _, pod := range pods {
        var priority int32
        if pod.Spec.Priority != nil {
            priority = *pod.Spec.Priority
        }

        // Find the group index according to the priority.
        index := sort.Search(len(groups), func(i int) bool {
            return groups[i].Priority >= priority
        })

        // 1. Those higher than the highest priority default to the highest priority
        // 2. Those lower than the lowest priority default to the lowest priority
        // 3. Those boundary priority default to the lower priority
        // if priority of pod is:
        //   groups[index-1].Priority <= pod priority < groups[index].Priority
        // in which case we want to pick lower one (i.e index-1)
        if index == len(groups) {
            index = len(groups) - 1
        } else if index < 0 {
            index = 0
        } else if index > 0 && groups[index].Priority > priority {
            index--
        }

        groups[index].Pods = append(groups[index].Pods, pod)
    }
    return groups
}

What happens when you add another Pod and set the priority class to system-node-critical, which is even higher than system-cluster-critical, and then run your test again? Do both have the same issues you noticed?

What is this issue you are seeing? Are the pods' statuses not updated? Are the pods not terminated? Is the order in which things are terminated incorrect?

Related:

avestuk commented 6 months ago

@kwilczynski

Graceful Node Shutdown settings are:

-    shutdownGracePeriod: 30s
-    shutdownGracePeriodCriticalPods: 10s

I am testing on a worker node.

The issue is that I'm seeing that the pod status is not being updated to reflect TerminationByKubelet. This means that the deployment controlling the pod never sees that another replica is needed and so no new pod is scheduled. This results in having to wait 300s for Taint Manager eviction. My expectation is that when Graceful Node Shutdown is configured that all pods are evicted from the node and rescheduled elsewhere in the cluster.

I'll update the initial post to clarify what the current state of affairs is.

ndixita commented 6 months ago

/assign @rphillips

/triage accepted

kwilczynski commented 6 months ago

@avestuk, thank you!

We did have some issues in the past with a few types of workloads where statuses are not being correctly reflected or synchronised. I wonder if you found a regression or a wholly new issue. For example:

Which looks somewhat similar to the issue you are seeing.

There are also other currently open issues against the Graceful Node Shutdown that might also be related.

tuibeovince commented 5 months ago

A little nitpicky but the title makes this issue a little hard to find. Shouldn't it say GracefulNodeShutdown instead of NodeGracefulShutdown?

AnishShah commented 5 months ago

/retitle GracefulNodeShutdown fail to update Pod status for system critical pods.

haircommander commented 1 month ago

@kwilczynski do you think we have the information needed to address this bug (not asking you to address it, just seeing if we have full context and can triage it)

kwilczynski commented 1 month ago

@haircommander, I believe we do, yes.

haircommander commented 1 month ago

/triage accepted /assign @kwilczynski /priority backlog

SergeyKanzhelev commented 1 month ago

/remove-triage needs-information

SergeyKanzhelev commented 1 month ago

/unassign @rphillips