kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.81k stars 39.6k forks source link

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Open Dragoncell opened 2 years ago

Dragoncell commented 2 years ago

What happened?

In some cases, we saw that in a graceful shutdown enabled cluster, the kubelet started the shutdown for some pods, but end up not finishing it, and these pods have the same behavior without the graceful shutdown for preemption VM

2022-06-23 08:42:16.744 PDT
gke-dev-default-pool-XXX
I0623 15:42:16.744070    1748 nodeshutdown_manager_linux.go:316] "Shutdown manager killing pod with gracePeriod" pod="XX/client-64ccc5cc77-6jrfk" gracePeriod=15

>> Expected to have `Shutdown manager finishied killing pod` logs 

>> After the new node comes up, the pod skip the scheduling process and try to starts before the node is ready 
2022-06-23 08:44:33.989 PDT
gke-dev-default-pool-XXX
E0623 15:44:33.989839    1746 pod_workers.go:951] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="XX/client-64ccc5cc77-6jrfk" podUID=4ee81e47-0e79-42fb-a7f6-40cd8e1a47a9

What did you expect to happen?

Pod set to terminated or failed status after node is shutdown

How can we reproduce it (as minimally and precisely as possible)?

it depends on the kubelet killing pod speed. probably could create lots of pods and have some postStop hook of these pods

Anything else we need to know?

No response

Kubernetes version

v1.23+

Cloud provider

GCP

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Dragoncell commented 2 years ago

/cc @bobbypage

wangyysde commented 2 years ago

Container runtime (CRI) and version ? as if containerd has the same bug. see: https://github.com/containerd/containerd/issues/7076

pacoxu commented 2 years ago

/sig node /cc @wzshiming

wzshiming commented 2 years ago

It looks like Kubelet didn't have enough time to report the Pod status to APIServer.

wzshiming commented 2 years ago

When the graceful shutdown time of the Pod is the same as the graceful shutdown time of the node, the Pod is killed and the inhibit lock is released at the same time. At this time, the Kubelet does not have enough time to report the status of the Pod exit.

matthyx commented 2 years ago

I am not a big fan of solving race conditions with "wait a little bit more". Do you think there are other ways of protecting this status report?

matthyx commented 2 years ago

/triage accepted /priority important-longterm

wzshiming commented 2 years ago

I am not a big fan of solving race conditions with "wait a little bit more". Do you think there are other ways of protecting this status report?

@matthyx

If the Pod is cleaned early, the Kubelet will also exit early, which is the maximum time the Kubelet will take to inhibit shutdown. Once the Inhibit time is up, the Kubelet can't stop Systemd from killing it, it can only request more time from Systemd when it initializes itself. There is no other way.

bobbypage commented 2 years ago

I'm also a bit uncertain of this approach of artificially adding extra time during shutdown (https://github.com/kubernetes/kubernetes/pull/110804#issuecomment-1240926997). In offline testing, I've seen cases where the pod status manager can have quite high latency even after pod resources are reclaimed. For example in cases when there is a large amount of pods being terminated the latency from the pod actually being terminating to time status update is sent can be quite large (I've seen >= 10 seconds).

One of the reasons is that we only report Terminated state after the pod resources are reclaimed which can take some time (https://github.com/kubernetes/kubernetes/blob/1ad457bff582c0800bcfd98755dbb73aa3bce9d0/pkg/kubelet/status/status_manager.go#L735-L740).

In general, we never have the full guarantee that the status update will be sent -- the api server can be down for example or kubelet API QPS can be throttled. I think we need to figure out a better solution here:

Perhaps some other avenues to explore:

1) Find out why what is general pod status update latency - @smarterclayton has a PR for that in progress to add metric: https://github.com/kubernetes/kubernetes/pull/107896
2) See if there are ways we can optimize and reduce the status update latency in the general case 3) Consider coming up with a new pod phase "terminating" (see https://github.com/kubernetes/kubernetes/issues/106884#issuecomment-1005074672 for prior discussion). If node is deleted/shutdown before pod is "terminated", status should be updated on the server side.

dghubble commented 2 years ago

In our v1.25.0 clusters, when a node is gracefully shutdown, many non-critical pods end up in an Error or Completed state, with

Message:      Pod was terminated in response to imminent node shutdown.
Reason:       Terminated

after enabling graceful node shutdown. Its not rare tail latency from my observations. As nodes auto-update, you end up with an increasing number of Error/Completed Pod clutter. They're technically harmless, but manually cleaning them up is a non-starter. Inspecting the container runtime, those pods were indeed killed. Ultimately, disabling Kubelet graceful node shutdown resolved the issue.

dghubble commented 2 years ago

Filed separately as https://github.com/kubernetes/kubernetes/issues/113278

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dghubble commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten