Closed PaulFurtado closed 8 months ago
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
/sig node
/kind support
/remove-kind support
let's find out whether it is a bug or not before applying support label
/triage needs-information
Any chance you can try supported k8s version? 1.23 is out of support in OSS.
With the load you have it may not be feasible. But if you can localize the issue and collect higher level kubelet logs (4) - it will help.
/cc @harche
Any chance you can try supported k8s version? 1.23 is out of support in OSS.
We're working on it, but unfortunately 1.24 and 1.25 remove so many APIs that it will be months before we have resolved all those deprecations and can actually upgrade our big clusters (multi-tenant, lots of different teams need to fix code).
With the load you have it may not be feasible. But if you can localize the issue and collect higher level kubelet logs (4) - it will help.
Happy to turn these on, we don't need these debug logs shipped off of the machine so there isn't a big cost impact.
I'd also love to help debugging/fixing this, but the pod lifecycle code in kubelet is pretty complicated and I could use a tip:
Given the pod status that I posted above, is it correct to say that the next step that should have occurred is that kubelet makes some API call to the apiserver that removes the deletionTimestamp
field? What function in kubelet it responsible for that?
I was having a hard time tracking that piece down when I was reading the code. Since restarting kubelet does not remedy this state, the next time we have a pod stuck like this we can try to fix it or at least add additional log statements, since we can endlessly recompile kublet and retry.
Thanks!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
What happened?
A job pod was terminating kubelet restarted late in that process, after which, the pod shows up with a Terminated status in
kubectl get pods
, but is inphase: Succeeded
with all of its resources cleaned up.kubectl shows:
The pod's metadata shows:
and the status shows:
It appears that the only difference between this pod and one that shows up as Completed in kubectl is that this one hasn't had its deletionTimestamp nulled out?
Logs for kubelet shutting down the pod:
Kubelet begins restarting at
20:18:27.944583
and comes back up at20:18:28.128820
. It then finishes the last volume after the restart:after this point, it never logs anything about the pod or its containers ever again.
What did you expect to happen?
Pod should not have gotten stuck in this state.
How can we reproduce it (as minimally and precisely as possible)?
It is difficult to reproduce since you likely need to restart kubelet at the exact right time.
Anything else we need to know?
Feel free to mark this as a duplicate of another issue, but I sifted through many of the related stuck terminating issues and couldn't find one that seemed like an exact match.
We see this issue crop up 0-4 times per week in large clusters (300-700 nodes, 4,000-16,000 pods). We had refrained from filing an upstream issue before because we blamed docker/cri-dockerd for most termination issues, but this issue persists on kubernetes 1.23.16 with cri-o.
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)