Open Dragoncell opened 2 years ago
/cc @bobbypage
Container runtime (CRI) and version ? as if containerd has the same bug. see: https://github.com/containerd/containerd/issues/7076
/sig node /cc @wzshiming
It looks like Kubelet didn't have enough time to report the Pod status to APIServer.
When the graceful shutdown time of the Pod is the same as the graceful shutdown time of the node, the Pod is killed and the inhibit lock is released at the same time. At this time, the Kubelet does not have enough time to report the status of the Pod exit.
I am not a big fan of solving race conditions with "wait a little bit more". Do you think there are other ways of protecting this status report?
/triage accepted /priority important-longterm
I am not a big fan of solving race conditions with "wait a little bit more". Do you think there are other ways of protecting this status report?
@matthyx
If the Pod is cleaned early, the Kubelet will also exit early, which is the maximum time the Kubelet will take to inhibit shutdown. Once the Inhibit time is up, the Kubelet can't stop Systemd from killing it, it can only request more time from Systemd when it initializes itself. There is no other way.
I'm also a bit uncertain of this approach of artificially adding extra time during shutdown (https://github.com/kubernetes/kubernetes/pull/110804#issuecomment-1240926997). In offline testing, I've seen cases where the pod status manager can have quite high latency even after pod resources are reclaimed. For example in cases when there is a large amount of pods being terminated the latency from the pod actually being terminating to time status update is sent can be quite large (I've seen >= 10 seconds).
One of the reasons is that we only report Terminated state after the pod resources are reclaimed which can take some time (https://github.com/kubernetes/kubernetes/blob/1ad457bff582c0800bcfd98755dbb73aa3bce9d0/pkg/kubelet/status/status_manager.go#L735-L740).
In general, we never have the full guarantee that the status update will be sent -- the api server can be down for example or kubelet API QPS can be throttled. I think we need to figure out a better solution here:
Perhaps some other avenues to explore:
1) Find out why what is general pod status update latency - @smarterclayton has a PR for that in progress to add metric: https://github.com/kubernetes/kubernetes/pull/107896
2) See if there are ways we can optimize and reduce the status update latency in the general case
3) Consider coming up with a new pod phase "terminating" (see https://github.com/kubernetes/kubernetes/issues/106884#issuecomment-1005074672 for prior discussion). If node is deleted/shutdown before pod is "terminated", status should be updated on the server side.
In our v1.25.0 clusters, when a node is gracefully shutdown, many non-critical pods end up in an Error or Completed state, with
Message: Pod was terminated in response to imminent node shutdown.
Reason: Terminated
after enabling graceful node shutdown. Its not rare tail latency from my observations. As nodes auto-update, you end up with an increasing number of Error/Completed Pod clutter. They're technically harmless, but manually cleaning them up is a non-starter. Inspecting the container runtime, those pods were indeed killed. Ultimately, disabling Kubelet graceful node shutdown resolved the issue.
Filed separately as https://github.com/kubernetes/kubernetes/issues/113278
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
What happened?
In some cases, we saw that in a graceful shutdown enabled cluster, the kubelet started the shutdown for some pods, but end up not finishing it, and these pods have the same behavior without the graceful shutdown for preemption VM
What did you expect to happen?
Pod set to terminated or failed status after node is shutdown
How can we reproduce it (as minimally and precisely as possible)?
it depends on the kubelet killing pod speed. probably could create lots of pods and have some postStop hook of these pods
Anything else we need to know?
No response
Kubernetes version
v1.23+
Cloud provider
GCP
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)