Open jgoeres opened 1 year ago
/sig node
Related issue #113606 . The pod worker cannot cancel the context. https://github.com/kubernetes/kubernetes/blob/7efa62dfdf96890f7f3cf95d957c7561e09055c4/pkg/kubelet/pod_workers.go#L776
/cc
/triage accepted /assign @smarterclayton
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@HirazawaUi: Reopened this issue.
/triage accepted /priority backlog
This is a known issue with pod lifecycle but it was not yet solved. Leaving it open.
What happened?
We introduced a postStart hook to one of our pods. Eventually, we noticed these pods hanging in "Terminating" state indefinitely when we try to delete them. Trying
kubectl logs
yieldedError from server (BadRequest): container "reproducer" in pod "reproducer" is waiting to start: ContainerCreating
That was the first hint that the postStart hook was the problem (since a pod is considered PodInitializing until its postStart hook exits).
We can exec into the container just fine, there we found that indeed our postStart hook was stuck. Killing all processes of the postStart hook made the pod terminate. Looking at the logs of our PID 1 process (luckily, we also write that to a file...) showed no sign of it receiving a SIGTERM (we have a shutdown hook in that application that would log if it were invoked).
From our observations, we conclude that while the pod is still not done with the postStart hook, K8s does not send a SIGTERM to the container's PID 1 process. To make matters worse, even after the terminationGracePeriod expired, no kill happens, leading to the observed behaviour.
What did you expect to happen?
The pod terminates - if not with a SIGTERM to the PID 1 process, then at least with a kill after the terminationGracePeriod expired.
How can we reproduce it (as minimally and precisely as possible)?
Apply the following pod spec:
(Note that we explicitly set the terminationGracePeriodSeconds to the default 30 for clarity.) In one terminal, run
watch -d kubectl get pods
. After applying, the pod will show up, but since the postStart hook will not exit, the pod will stay in ContainerCreating. Confirm that you cannot see the logs withkubectl logs reproducer
(error messageError from server (BadRequest): container "reproducer" in pod "reproducer" is waiting to start: ContainerCreating
)Exec into the pod with
kubectl exec -it reproducer -- /bin/bash
Confirm that both the root process and the post start hook are running withps aux
:Exit. Delete the pod with
kubectl delete reproducer
. Observe that the pod goes from Container Creating to Terminating and stays in that state indefinitely (and stays that way indefinitely, even after the termination grace period of 30secs). Exec into the pod again, runtail -f log.txt
, observe the log output... you should only see linesFaking work
. In another shell, create the file that will make the postStart hook exit successfully withkubectl exec reproducer -- /bin/touch /postStartDone
poststart hook done
will be logged and eventually, the pod will terminate.Anything else we need to know?
No response
Kubernetes version
Originally observed on EKS v1.24.8-eks-ffeb93d, reproduced on AKS v1.23.8 and v1.25.2, Minikube v1.24.10 and v1.25.6
Cloud provider
See above AWS ( EKS v1.24.8-eks-ffeb93d) and Azure AKS v1.23.8 and v1.25.2
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)