Open heartlock opened 1 year ago
/sig node
ref: #105204
ping @liggitt
/sig storage
Timeline when race conditions occur:
@liggitt related to https://github.com/kubernetes/kubernetes/pull/107821
https://github.com/kubernetes/kubernetes/pull/107821 didn't change anything about mounting or cleanup of the volumes, just when the kubelet starts/stops watching the API for updates to the secret or configmap object.
I don't doubt there's an issue here, but I don't think https://github.com/kubernetes/kubernetes/pull/107821 is related.
After #107821 the terminating pod will not be registered into the configmap/secret manager when the kubelet restarts. It caused the failure to mount the configmap/secret volume, and the configmap/secret volume was not added to the ASW.
Nov 26 01:57:13 1.23-control-plane kubelet[63957]: E1126 01:57:13.915231 63957 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/b5ca3256-1f77-4ba8-a481-d3395b326dc3-config-volume podName:b5ca3256-1f77-4ba8-a481-d3395b326dc3 nodeName:}" failed. N
o retries permitted until 2022-11-26 01:57:45.915204598 +0000 UTC m=+71.914733599 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "config-volume" (UniqueName: "kubernetes.io/configmap/b5ca3256-1f77-4ba8-a481-d3395b326dc3-config-volume") pod "coredns-75bb78cb6
8-z9vfh" (UID: "b5ca3256-1f77-4ba8-a481-d3395b326dc3") : object "kube-system"/"coredns" not registered
I think in this scenario terminating pod should be registered into secret/configmap manager. @liggitt
cc @bobbypage @smarterclayton for any insights into the sequence of calls to syncPod / syncTerminatedPod and how that interacts with kubelet restarts
ping @bobbypage @smarterclayton I'm glad to get your feedback
/cc
/assign
let me try to repro
/cc @smarterclayton /assign @rphillips
/triage accepted
We also hit this bug on kubernetes 1.23.13 today. Restarting kubelet did not fix the problem.
Log lines from the node with the problem:
E0222 16:50:21.615499 2536251 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/projected/24e15a99-f1aa-450a-b292-3d81f092aba8-kube-api-access-q9jck podName:24e15a99-f1aa-450a-b292-3d81f092aba8 nodeName:}" failed. No retries permitted until 2023-02-22 16:52:23.615484035 +0000 UTC m=+871.878258815 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "kube-api-access-q9jck" (UniqueName: "kubernetes.io/projected/24e15a99-f1aa-450a-b292-3d81f092aba8-kube-api-access-q9jck") pod "hs-infrawatch-runner-7d59c86dc-kpp29" (UID: "24e15a99-f1aa-450a-b292-3d81f092aba8") : object "infrawatch-beta"/"kube-root-ca.crt" not registered
E0222 16:50:21.615513 2536251 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/projected/3453602a-659a-46c9-b7ad-4c673c2fcc7d-kube-api-access-mq94b podName:3453602a-659a-46c9-b7ad-4c673c2fcc7d nodeName:}" failed. No retries permitted until 2023-02-22 16:52:23.615504959 +0000 UTC m=+871.878279739 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "kube-api-access-mq94b" (UniqueName: "kubernetes.io/projected/3453602a-659a-46c9-b7ad-4c673c2fcc7d-kube-api-access-mq94b") pod "github-kube-deploy-config-commit-ttl-daemon-7dbfd99cbc-q6tbm" (UID: "3453602a-659a-46c9-b7ad-4c673c2fcc7d") : object "di-kube-common"/"kube-root-ca.crt" not registered
Maybe I am doing something wrong. I can't seem to reproduce the issue on my local 2 node minikube cluster
suraj@suraj$ kubectl version --short
...
Server Version: v1.23.11
...
Make sure coredns pods run on the non-master node (minikube-m02
)
Stop kubelet. Pod gets stuck in Terminating
state as expected
Pod UID: 722fc035-7626-4253-9584-b1099f83e9e6
Pod is gone after starting kubelet again
Pod directory is gone as well
Note that I have permission to view the pod directories
kubelet logs
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.317892 4697 reconciler.go:157] "Reconciler: start to sync state"
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.484087 4697 reconciler.go:192] "operationExecutor.UnmountVolume started for volume \"config-volume\" (UniqueName: \"kubernetes.io/configmap/722fc035-7626-4253-9584-b1099f83e9e6-config-volume\") pod \"722fc035-7626-4253-9584-b1099f83e9e6\" (UID: \"722fc035-7626-4253-9584-b1099f83e9e6\") "
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.484125 4697 reconciler.go:192] "operationExecutor.UnmountVolume started for volume \"kube-api-access-ctl5p\" (UniqueName: \"kubernetes.io/projected/722fc035-7626-4253-9584-b1099f83e9e6-kube-api-access-ctl5p\") pod \"722fc035-7626-4253-9584-b1099f83e9e6\" (UID: \"722fc035-7626-4253-9584-b1099f83e9e6\") "
Jun 20 06:23:33 minikube-m02 kubelet[4697]: W0620 06:23:33.484365 4697 empty_dir.go:517] Warning: Failed to clear quota on /var/lib/kubelet/pods/722fc035-7626-4253-9584-b1099f83e9e6/volumes/kubernetes.io~configmap/config-volume: clearQuota called, but quotas disabled
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.484435 4697 operation_generator.go:910] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/722fc035-7626-4253-9584-b1099f83e9e6-config-volume" (OuterVolumeSpecName: "config-volume") pod "722fc035-7626-4253-9584-b1099f83e9e6" (UID: "722fc035-7626-4253-9584-b1099f83e9e6"). InnerVolumeSpecName "config-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.485221 4697 operation_generator.go:910] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/722fc035-7626-4253-9584-b1099f83e9e6-kube-api-access-ctl5p" (OuterVolumeSpecName: "kube-api-access-ctl5p") pod "722fc035-7626-4253-9584-b1099f83e9e6" (UID: "722fc035-7626-4253-9584-b1099f83e9e6"). InnerVolumeSpecName "kube-api-access-ctl5p". PluginName "kubernetes.io/projected", VolumeGidValue ""
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.584579 4697 reconciler.go:300] "Volume detached for volume \"config-volume\" (UniqueName: \"kubernetes.io/configmap/722fc035-7626-4253-9584-b1099f83e9e6-config-volume\") on node \"minikube-m02\" DevicePath \"\""
Jun 20 06:23:33 minikube-m02 kubelet[4697]: I0620 06:23:33.584633 4697 reconciler.go:300] "Volume detached for volume \"kube-api-access-ctl5p\" (UniqueName: \"kubernetes.io/projected/722fc035-7626-4253-9584-b1099f83e9e6-kube-api-access-ctl5p\") on node \"minikube-m02\" DevicePath \"\""
Jun 20 06:23:40 minikube-m02 kubelet[4697]: I0620 06:23:40.675366 4697 scope.go:110] "RemoveContainer" containerID="9426e42fa39b243e31f120d8a3d5b0c72981aad0c71039c690aea79966d96d07"
Jun 20 06:23:42 minikube-m02 kubelet[4697]: I0620 06:23:42.239286 4697 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=722fc035-7626-4253-9584-b1099f83e9e6 path="/var/lib/kubelet/pods/722fc035-7626-4253-9584-b1099f83e9e6/volumes"
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/unassign @rphillips @SergeyKanzhelev /remove-sig node
removing sig node since it got into the sig storage backlog. Let sig storage to remove stale and re-triage
What happened?
the pod is successfully removed but the configmap volume remains on the node
What did you expect to happen?
The pod is successfully removed add clean volumes on the node
How can we reproduce it (as minimally and precisely as possible)?
$ kubectl -n kube-system get po -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-75bb78cb68-pnrrh 1/1 Terminating 0 6m25s 10.244.0.27 1.23-control-plane
etcd-1.23-control-plane 1/1 Running 0 3d23h 172.18.0.2 1.23-control-plane
kindnet-vt9q5 1/1 Running 0 22h 172.18.0.2 1.23-control-plane
kube-apiserver-1.23-control-plane 1/1 Running 0 3d23h 172.18.0.2 1.23-control-plane
kube-controller-manager-1.23-control-plane 1/1 Running 0 3d23h 172.18.0.2 1.23-control-plane
kube-proxy-4k9kz 1/1 Running 0 3d23h 172.18.0.2 1.23-control-plane
kube-scheduler-1.23-control-plane 1/1 Running 0 3d23h 172.18.0.2 1.23-control-plane
Anything else we need to know?
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)