kubelet: volumeManager.WaitForUnmount does not wait for emptydir to be unmounted successfully

bobbypage commented 1 year ago

What happened?

Kubelet has an internal syncTerminatedPod which is called after pods are terminated. The function is responsible for some final pod cleanup and is responsible to ensure that volumes mounted to the pod are unmounted. The function calls volumeManager.WaitForUnmount:

https://github.com/kubernetes/kubernetes/blob/7d9c0e0a78e519cac0f892a8be2f063bedc94bad/pkg/kubelet/kubelet.go#L1881-L1885

As part of doing some testing for a different issue, I came across an issue with emptydir handling of unmounting -- it looks like that volumeManager.WaitForUnmount will return true if even if the empty dir was not unmounted successfully.

Chatted with @msau42 about this issue and it seems this is because, during WaitForUnmount, it is checking for mounted state: https://github.com/kubernetes/kubernetes/blob/7d9c0e0a78e519cac0f892a8be2f063bedc94bad/pkg/kubelet/volumemanager/cache/actual_state_of_world.go#L965

However, if there is an error during unmounting the volume is marked as "uncertain" (https://github.com/kubernetes/kubernetes/blob/7d9c0e0a78e519cac0f892a8be2f063bedc94bad/pkg/volume/util/operationexecutor/operation_generator.go#L879), which results in WaitForUnmount succeeding despite an error during unmounting. It's unclear if this is expected behavior.

What did you expect to happen?

I expected that volumeManager.WaitForUnmount will block (or return error) if the emptydir had an error unmounting.

How can we reproduce it (as minimally and precisely as possible)?

Create a kind cluster (I used v1.25.2 node image)

kind_config="$(cat << EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: ipv4
nodes:
# the control plane node
- role: control-plane
- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "4"
EOF
)"

kind create cluster --config <(printf '%s\n' "${kind_config}") --image kindest/node:v1.25.2@sha256:9be91e9e9cdf116809841fc77ebdb8845443c4c72fe5218f3ae9eb57fdb4bace

Create the following pod:

# test-pd.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pd
  namespace: default
spec:
  containers:
  - image: busybox
    command:
      - sleep
      - "999999"
    name: test-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
  volumes:
  - name: cache-volume
    emptyDir: {}

$ kubectl apply -f test-emptydir.yaml

Get the pod uid

$ kubectl get pod test-pd -o json | jq .metadata.uid
"5a8ef092-2173-4175-9557-947b73f9a8f9"

Enter the kind-worker (node) and do a chattr +i on the emtpydir. This will make the emptydir volume immutable and prevent it from being unmounted (and deleted).

$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# cd /var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir
root@kind-worker:/var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir#
root@kind-worker:/var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir# chattr +i cache-volume/
root@kind-worker:/var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir# lsattr
----i---------e------- ./cache-volume

Now delete, the pod:

kubectl delete pod  test-pd --grace-period=5

Here's the logs:

Kubelet logs - https://gist.github.com/4f9b1fa0edda8d260c90a1f18c9dc6e5 Kubelet logs for test-pd pod: https://gist.github.com/d63b8713b71bef1e712ca138fcb5d602

The notable logs:

# pod started
Nov 01 01:53:33 kind-worker kubelet[275]: I1101 01:53:33.808887     275 kubelet.go:2096] "SyncLoop ADD" source="api" pods=[default/test-pd]
Nov 01 01:53:33 kind-worker kubelet[275]: I1101 01:53:33.809011     275 pod_workers.go:585] "Pod is being synced for the first time" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9

Termination (syncTerminatingPod):

Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605701     275 kubelet.go:1765] "syncTerminatingPod enter" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605720     275 kubelet_pods.go:1447] "Generating pod status" pod="default/test-pd"
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605789     275 kubelet_pods.go:1457] "Got phase for pod" pod="default/test-pd" oldPhase=Running phase=Running
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605918     275 kubelet.go:1795] "Pod terminating with grace period" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9 gracePeriod=5
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605970     275 kuberuntime_container.go:698] "Killing container with a grace period override" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9 containerName="test-container" containerID="containerd://d3e6225e41896d5f4fc90a42840ef3d24bb35482676a2be7d4a61af6c6b4ea9d" gracePeriod=5
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.605997     275 kuberuntime_container.go:702] "Killing container with a grace period" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9 containerName="test-container" containerID="containerd://d3e6225e41896d5f4fc90a42840ef3d24bb35482676a2be7d4a61af6c6b4ea9d" gracePeriod=5
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.606148     275 event.go:294] "Event occurred" object="default/test-pd" fieldPath="spec.containers{test-container}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container test-container"
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.617051     275 status_manager.go:688] "Patch status for pod" pod="default/test-pd" patch="{\"metadata\":{\"uid\":\"5a8ef092-2173-4175-9557-947b73f9a8f9\"}}"
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.617099     275 status_manager.go:695] "Status for pod is up-to-date" pod="default/test-pd" statusVersion=3
Nov 01 01:56:34 kind-worker kubelet[275]: I1101 01:56:34.617120     275 kubelet_pods.go:938] "Pod is terminated, but some containers are still running" pod="default/test-pd"
Nov 01 01:56:39 kind-worker kubelet[275]: I1101 01:56:39.815024     275 kuberuntime_container.go:711] "Container exited normally" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9 containerName="test-container" containerID="containerd://d3e6225e41896d5f4fc90a42840ef3d24bb35482676a2be7d4a61af6c6b4ea9d"

...

volume not unmounted -> (`operation not permitted`) due to `chattr +i`

Nov 01 01:56:40 kind-worker kubelet[275]: E1101 01:56:40.075488     275 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/empty-dir/5a8ef092-2173-4175-9557-947b73f9a8f9-cache-volume podName:5a8ef092-2173-4175-9557-947b73f9a8f9 nodeName:}" failed. No retries permitted until 2022-11-01 01:56:40.575457522 +0000 UTC m=+246.937181939 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "cache-volume" (UniqueName: "kubernetes.io/empty-dir/5a8ef092-2173-4175-9557-947b73f9a8f9-cache-volume") pod "5a8ef092-2173-4175-9557-947b73f9a8f9" (UID: "5a8ef092-2173-4175-9557-947b73f9a8f9") : unlinkat /var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir/cache-volume: operation not permitted

But syncTerminatedPod succeeded (despite the volume not actually unmounting) (This is because WaitForUnmount succeeded incorrectly)

Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324119     275 kubelet.go:1863] "syncTerminatedPod enter" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324148     275 kubelet_pods.go:1447] "Generating pod status" pod="default/test-pd"
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324176     275 kubelet.go:2134] "SyncLoop (PLEG): event for pod" pod="default/test-pd" event=&{ID:5a8ef092-2173-4175-9557-947b73f9a8f9 Type:ContainerDied Data:d3e6225e41896d5f4fc90a42840ef3d24bb35482676a2be7d4a61af6c6b4ea9d}
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324199     275 kubelet_pods.go:1457] "Got phase for pod" pod="default/test-pd" oldPhase=Running phase=Running
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324243     275 kubelet.go:2134] "SyncLoop (PLEG): event for pod" pod="default/test-pd" event=&{ID:5a8ef092-2173-4175-9557-947b73f9a8f9 Type:ContainerDied Data:a34e590797c764b7bccf00fe20688406ad15474b2d02287703625e18125d9c43}
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324282     275 volume_manager.go:448] "Waiting for volumes to unmount for pod" pod="default/test-pd"

--> This shows `WaitForUnmount` succeeded, but the volume didn't actually unmount successfully

Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324432     275 volume_manager.go:475] "All volumes are unmounted for pod" pod="default/test-pd"
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.324461     275 kubelet.go:1876] "Pod termination unmounted volumes" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9

...

Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.333021     275 kubelet.go:1897] "Pod termination removed cgroups" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.333135     275 kubelet.go:1904] "Pod is terminated and will need no more status updates" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9
Nov 01 01:56:40 kind-worker kubelet[275]: I1101 01:56:40.333158     275 kubelet.go:1906] "syncTerminatedPod exit" pod="default/test-pd" podUID=5a8ef092-2173-4175-9557-947b73f9a8f9

Volume continues to try to be unmounted later

Nov 01 01:57:11 kind-worker kubelet[275]: E1101 01:57:11.691319     275 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/empty-dir/5a8ef092-2173-4175-9557-947b73f9a8f9-cache-volume podName:5a8ef092-2173-4175-9557-947b73f9a8f9 nodeName:}" failed. No retries permitted until 2022-11-01 01:57:43.691274244 +0000 UTC m=+310.052998673 (durationBeforeRetry 32s). Error: UnmountVolume.TearDown failed for volume "cache-volume" (UniqueName: "kubernetes.io/empty-dir/5a8ef092-2173-4175-9557-947b73f9a8f9-cache-volume") pod "5a8ef092-2173-4175-9557-947b73f9a8f9" (UID: "5a8ef092-2173-4175-9557-947b73f9a8f9") : unlinkat /var/lib/kubelet/pods/5a8ef092-2173-4175-9557-947b73f9a8f9/volumes/kubernetes.io~empty-dir/cache-volume: operation not permitted
...
Nov 01 01:57:43 kind-worker kubelet[275]: I1101 01:57:43.789411     275 kubelet_pods.go:949] "Pod is terminated, but some volumes have not been cleaned up" pod="default/test-pd"

Anything else we need to know?

No response

Kubernetes version

1.25.2

Cloud provider

n/a

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

bobbypage commented 1 year ago

/sig storage /sig node

/cc @msau42

bobbypage commented 1 year ago

/kind bug

SergeyKanzhelev commented 1 year ago

/triage accepted /priority important-longterm

doesn't look like a regression

/cc @xmcqueen

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

msau42 commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 8 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

msau42 commented 2 months ago

/remove lifecycle-rotten

BenTheElder commented 2 months ago

Another one for https://github.com/kubernetes/test-infra/issues/32957

This issue looks like it may be tricky to solve but remains a source of flakes and should stay tracked.

kubernetes / kubernetes