Open avestuk opened 6 months ago
/sig node
@avestuk: The label(s) sig/sig-node
cannot be applied, because the repository doesn't have them.
/sig node
cc @kwilczynski
I'm wondering if we can state this as bug or should be handled as enhancement to the graceful shutdown feature.
This could be due to the kubelet receiving an shutdown event of node, leading it to delete the pod, while the KCM (kube-controller-manager) initiates pod eviction after a 30-second delay. By the time the KCM starts the eviction process, the node has already shut down, but the API server has not yet been updated with the information that the pod has been deleted, so the pod is always in the Terminating
state.
If I understand correctly, I do think it can be improved.
Graceful node shutdown feature does not evict pods based on taints & tolerations. It first evicts all the pods and then the system pods. In this case, coredns might be running as a system pod?
/triage needs-information
@AnishShah I found yesterday that my intital report was spurious, this does seem to be due to priority classes and CoreDNS was indeed running as a system-cluster-critical
pod and so is falling into the ShutdownGracePeriodCriticalPods
timeout.
I'm able to consistently reproduce the failure using a minimal example deployment, where when the PriorityClassName for the deployment is set to "system-cluster-critical", then NodeGracefulShutdown fails.
Below are logs from a successful pod eviction due to graceful node shutdown. The pod is part of a deployment without a PriorityClassName specified.
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621576 718 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-68969998d9-7nqcn" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621746 718 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="kill"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621834 718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621926 718 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622017 718 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622101 718 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622188 718 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622262 718 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-68969998d9-7nqcn"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622415 718 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-68969998d9-7nqcn" oldPhase="Running" phase="Running"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622546 718 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containers="(busybox state=running previous=<none>)"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622648 718 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622757 718 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePerio>
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622864 718 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623182 718 status_manager.go:227] "Syncing updated statuses"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623306 718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623567 718 event.go:376] "Event occurred" object="kube-system/test-68969998d9-7nqcn" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.630224 718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.632487 718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3
pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.729875 718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: &v1.PodStatus{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Phase: "Running",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Phase: "Failed",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: Conditions: []v1.PodCondition{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: ... // 3 identical elements
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + {
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Type: "DisruptionTarget",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Status: "True",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Reason: "TerminationByKubelet",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Message: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Reason: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Reason: "Terminated",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: NominatedNodeName: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: HostIP: "192.168.161.11",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: ... // 2 identical fields
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: PodIPs: {{IP: "10.2.0.131"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: StartTime: s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: - InitContainerStatuses: nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + InitContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: ContainerStatuses: {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: QOSClass: "BestEffort",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: - EphemeralContainerStatuses: nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: + EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: Resize: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: ResourceClaimStatuses: nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: }
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: >
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.730751 718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735733 718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735835 718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.730445 718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: &v1.PodStatus{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Phase: "Running",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Phase: "Failed",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: Conditions: []v1.PodCondition{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: ... // 3 identical elements
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + {
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Type: "DisruptionTarget",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Status: "True",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Reason: "TerminationByKubelet",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Message: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: - Reason: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + Reason: "Terminated",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: NominatedNodeName: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: HostIP: "192.168.161.11",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: ... // 2 identical fields
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: PodIPs: {{IP: "10.2.0.131"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: StartTime: s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: - InitContainerStatuses: nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + InitContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: ContainerStatuses: {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: QOSClass: "BestEffort",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: - EphemeralContainerStatuses: nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: + EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: Resize: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: ResourceClaimStatuses: nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: }
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: >
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.731178 718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.732865 718 kubelet.go:2505] "SyncLoop (housekeeping) end" duration="5ms"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.241307 718 kubelet.go:2885] "Container runtime status" status="Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=true reason: message:, ContainerdHasNoDeprecationWarnings=true reason: message:"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.242771 718 generic.go:224] "GenericPLEG: Relisting"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.243558 718 kuberuntime_manager.go:437] "Retrieved pods from runtime" all=true
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.362998 718 kubelet_node_status.go:538] "Updating node status"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.592638 718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.593094 718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3
Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727400 718 kubelet.go:2466] "SyncLoop (SYNC) pods" total=1 pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727876 718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.047774 718 kubelet.go:2428] "SyncLoop RECONCILE" source="api" pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049563 718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"DisruptionTarget\"},{\"type\":>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049894 718 status_manager.go:883] "Status for pod updated successfully" pod="kube-system/test-68969998d9-7nqcn" statusVersion=4 status={"phase":"Failed","conditions":[{"type":"DisruptionTarget","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-04-30T09:38:37Z","reason":"TerminationByKube>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.050132 718 status_manager.go:833] "Sync pod status" podUID="8999dbb6-9b04-4300-acde-41e81f01c62a" statusUID="8999dbb6-9b04-4300-acde-41e81f01c62a" version=1
Now logs from the same test but with the Priority Class name set to system-cluster-critical
. I believe that the crucial difference is that in the below example we never see a sucess message from the status_manager
where the pod condidtions are set to reflect the TerminationByKubelet state.
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.028743 1278 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-7db8c76ddc-mnjmg" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030611 1278 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="kill"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030688 1278 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030769 1278 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030852 1278 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030933 1278 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031024 1278 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031102 1278 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-7db8c76ddc-mnjmg"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031201 1278 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-7db8c76ddc-mnjmg" oldPhase="Running" phase="Running"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031306 1278 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containers="(busybox state=running previous=<none>)"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031461 1278 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031635 1278 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeri>
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031738 1278 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031877 1278 event.go:376] "Event occurred" object="kube-system/test-7db8c76ddc-mnjmg" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.846215 1278 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-7db8c76ddc-mnjmg" statusDiff=<
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: &v1.PodStatus{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: - Phase: "Running",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Phase: "Failed",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: Conditions: []v1.PodCondition{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: ... // 3 identical elements
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:58 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:40 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + {
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Type: "DisruptionTarget",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Status: "True",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + LastTransitionTime: s"2024-04-30 09:08:11 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Reason: "TerminationByKubelet",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: - Message: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Message: "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: - Reason: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + Reason: "Terminated",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: NominatedNodeName: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: HostIP: "192.168.161.11",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: ... // 2 identical fields
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: PodIPs: {{IP: "10.2.0.131"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: StartTime: s"2024-04-30 09:07:40 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: - InitContainerStatuses: nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + InitContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: ContainerStatuses: {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:07:57 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: QOSClass: "BestEffort",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: - EphemeralContainerStatuses: nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: + EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: Resize: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: ResourceClaimStatuses: nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: }
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: >
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.848233 1278 status_manager.go:833] "Sync pod status" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" statusUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" version=3
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.038338 1278 request.go:629] Waited for 190.015469ms due to client-side throttling, not priority and fairness, request: GET:https://a89588f5-edcc-4788-b2ac-d1ed32f7e448.cpc-av8.avest.lkedevs.net:443/api/v1/namespaces/kube-system/pods/test-7db8c76ddc-mnjmg
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043024 1278 status_manager.go:874] "Patch status for pod" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" patch="{\"metadata\":{\"uid\":\"ae416f84-b43f-4490-be95-a7d1b556ae9b\"}}"
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043109 1278 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-7db8c76ddc-mnjmg" statusVersion=3
Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.029637 1278 nodeshutdown_manager_linux.go:413] "Shutdown manager pod killing time out" gracePeriod=10 priority=2000000000
Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.045581 1278 nodeshutdown_manager_linux.go:327] "Shutdown manager completed processing shutdown event, node will shutdown shortly"
My deployment definition:
deployment, err := kubernetes.AppsV1().Deployments("kube-system").Create(ctx, &appsv1.Deployment{
ObjectMeta: metaV1.ObjectMeta{
Name: "test",
Labels: map[string]string{"foo": "bar"},
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metaV1.LabelSelector{
MatchLabels: map[string]string{"foo": "bar"},
},
Template: v1.PodTemplateSpec{
ObjectMeta: metaV1.ObjectMeta{
Labels: map[string]string{"foo": "bar"},
},
Spec: v1.PodSpec{
PriorityClassName: "system-cluster-critical",
Containers: []v1.Container{
{
Name: "busybox",
Image: "busybox:stable-glibc",
Command: []string{"sleep"},
Args: []string{"infinity"},
},
},
},
},
},
}, metaV1.CreateOptions{})
What do you mean by NodeGracefulShutdown fails? do you not see a pod status update? Is it possible to share the full kubelet logs?
/retitle NodeGracefulShutdown fail to update Pod status for system critical pods
@avestuk, so this is no longer a problem related to setting up tolerations for some Pods?
Graceful Node Shutdown manager does not have any means to convey that it will terminate some Pods, other than setting up the an appropriate node status and taints, as such, a potential race to evict and/or terminate Pods with the Taint (Eviction?) manager could take place.
The KEP and then the implementation most likely did not take such a scenario into account.
However, you sound as if this is no longer an issue?
@avestuk, for your recent update.
How is your Graceful Node Shutdown configured at the moment? Would you be able to share the relevant configuration?
Also, are you testing on a worker node or a member of the control plane?
There is nothing specific in terms of how the Pod of a certain priority is terminated. The order in which these are terminated is different, but the actual process is the same. Thus, I am not sure why the problem you are seeing only applies to Pods with some priority class set.
_(taken from kubernetes/kubernetes//pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L452)_
func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
for _, period := range shutdownGracePeriodByPodPriority {
groups = append(groups, podShutdownGroup{
ShutdownGracePeriodByPodPriority: period,
})
}
for _, pod := range pods {
var priority int32
if pod.Spec.Priority != nil {
priority = *pod.Spec.Priority
}
// Find the group index according to the priority.
index := sort.Search(len(groups), func(i int) bool {
return groups[i].Priority >= priority
})
// 1. Those higher than the highest priority default to the highest priority
// 2. Those lower than the lowest priority default to the lowest priority
// 3. Those boundary priority default to the lower priority
// if priority of pod is:
// groups[index-1].Priority <= pod priority < groups[index].Priority
// in which case we want to pick lower one (i.e index-1)
if index == len(groups) {
index = len(groups) - 1
} else if index < 0 {
index = 0
} else if index > 0 && groups[index].Priority > priority {
index--
}
groups[index].Pods = append(groups[index].Pods, pod)
}
return groups
}
What happens when you add another Pod and set the priority class to system-node-critical
, which is even higher than system-cluster-critical
, and then run your test again? Do both have the same issues you noticed?
What is this issue you are seeing? Are the pods' statuses not updated? Are the pods not terminated? Is the order in which things are terminated incorrect?
Related:
@kwilczynski
Graceful Node Shutdown settings are:
- shutdownGracePeriod: 30s
- shutdownGracePeriodCriticalPods: 10s
I am testing on a worker node.
The issue is that I'm seeing that the pod status is not being updated to reflect TerminationByKubelet
. This means that the deployment controlling the pod never sees that another replica is needed and so no new pod is scheduled. This results in having to wait 300s for Taint Manager eviction. My expectation is that when Graceful Node Shutdown is configured that all pods are evicted from the node and rescheduled elsewhere in the cluster.
I'll update the initial post to clarify what the current state of affairs is.
/assign @rphillips
/triage accepted
@avestuk, thank you!
We did have some issues in the past with a few types of workloads where statuses are not being correctly reflected or synchronised. I wonder if you found a regression or a wholly new issue. For example:
Which looks somewhat similar to the issue you are seeing.
There are also other currently open issues against the Graceful Node Shutdown that might also be related.
A little nitpicky but the title makes this issue a little hard to find. Shouldn't it say GracefulNodeShutdown instead of NodeGracefulShutdown?
/retitle GracefulNodeShutdown fail to update Pod status for system critical pods.
@kwilczynski do you think we have the information needed to address this bug (not asking you to address it, just seeing if we have full context and can triage it)
@haircommander, I believe we do, yes.
/triage accepted /assign @kwilczynski /priority backlog
/remove-triage needs-information
/unassign @rphillips
What happened?
Pods with a
system-cluster-critical
priority class set are not evicted by Graceful node shutdown when a node is slated to be shutdown. Instead they are evicted by the Taint Manager once the tolerationSeconds period expires.What did you expect to happen?
I'd expect the presence of priority classes to be orthogonal to whether or not Graceful node shutdown evicts the pods or not. If graceful node shutdown runs on node shutdown then pods should be evicted before Taint Manager runs.
How can we reproduce it (as minimally and precisely as possible)?
Run a deployment with the below tolerations
Schedule a node shutdown or trigger a shutdown event using gdbus
Observe that pods from the deployment enter a "Terminating" state and have conditions indicating they were evicted by the Taint manager
The pod status is never updated to show TerminationByKubelet and stays in Running state until the eviction by Taint Manager
Anything else we need to know?
Kubelet logs will indicate that the node shutdown manager finished terminating the pod but this doesn't get reflected in the API.
Pod in the API server
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)