PVCs are "Bound" via dynamic provisioning -- MountVolume.MountDevice failed / not attached to node

wiredcolony commented 1 month ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: I am unable to get my PVCs to mount the respective pods. The volumes are present in "container volumes" in vSphere and PVCs & PVs are BOUND. Provisioning doesn't seem to be the issue. I have parameters "disk.EnableUUID = TRUE" & "ctkEnabled=FALSE". The storage class used in K8 is set to "default" and using a custom vSphere storage policy. Also, "VolumeExpansion=true.

What you expected to happen: After PVC is bound, I expect the PV to mount the pod so pod/containers leave the "CreatingContainer" state.

How to reproduce it (as minimally and precisely as possible): create storage class ->create PVC-->create pod/run helm chart requiring PVs

Anything else we need to know?: vSphere files services are enabled.

Logs kubelet:

8054 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94 podName: nodeName:}" failed. No retries permitted until 2024-07-16 23:55:13.605158725 +0000 UTC m=+1236.629690811 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94") pod "task-pv-pod" (UID: "11fd6ace-c0c7-4ac8-930a-679d2b10e778") : rpc error: code = NotFound desc = disk: 6000c29ba0cd1d1581a00e6f2f53df29 not attached to node
Jul 16 23:54:45 kadm-prac-wk-1 kubelet[8054]: E0716 23:54:45.056752    8054 pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[task-pv-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/task-pv-pod" podUID="11fd6ace-c0c7-4ac8-930a-679d2b10e778"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: I0716 23:55:13.697301    8054 operation_generator.go:622] "MountVolume.WaitForAttach entering for volume \"pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94\") pod \"task-pv-pod\" (UID: \"11fd6ace-c0c7-4ac8-930a-679d2b10e778\") DevicePath \"\"" pod="default/task-pv-pod"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: I0716 23:55:13.698733    8054 operation_generator.go:632] "MountVolume.WaitForAttach succeeded for volume \"pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94\") pod \"task-pv-pod\" (UID: \"11fd6ace-c0c7-4ac8-930a-679d2b10e778\") DevicePath \"csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9\"" pod="default/task-pv-pod"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: E0716 23:55:13.701674    8054 csi_attacher.go:366] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = NotFound desc = disk: 6000c29ba0cd1d1581a00e6f2f53df29 not attached to node

Logs CSI-Controller:

1 reflector.go:800] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.PersistentVolume total 10 items received
I0716 23:54:35.994760       1 reflector.go:800] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.CSINode total 9 items received
I0716 23:54:58.619468       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:54:58.633808       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:55:58.633874       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:55:58.647525       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:56:58.647774       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:56:58.661844       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:57:57.991286       1 reflector.go:377] k8s.io/client-go/informers/factory.go:159: forcing resync
I0716 23:57:57.991332       1 controller.go:210] Started VA processing "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.991343       1 csi_handler.go:224] CSIHandler: processing VA "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.991349       1 csi_handler.go:246] "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9" is already attached
I0716 23:57:57.991353       1 csi_handler.go:240] CSIHandler: finished processing "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.992445       1 reflector.go:377] k8s.io/client-go/informers/factory.go:159: forcing resync
I0716 23:57:58.661952       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:57:58.675563       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1

Logs CSI Node (I just noticed the errors here, I am not sure what it means):

I0716 23:17:42.078188       1 main.go:135] Version: v2.10.1
I0716 23:17:42.078253       1 main.go:136] Running node-driver-registrar in mode=
I0716 23:17:42.078258       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0716 23:17:42.082382       1 connection.go:215] Connecting to unix:///csi/csi.sock
I0716 23:17:44.263539       1 main.go:164] Calling CSI driver to discover driver name
I0716 23:17:44.263563       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I0716 23:17:44.263566       1 connection.go:245] GRPC request: {}
I0716 23:17:44.266892       1 connection.go:251] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.3.0"}
I0716 23:17:44.266903       1 connection.go:252] GRPC error: <nil>
I0716 23:17:44.266908       1 main.go:173] CSI driver name: "csi.vsphere.vmware.com"
I0716 23:17:44.266924       1 node_register.go:55] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0716 23:17:44.267028       1 node_register.go:64] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0716 23:17:44.267144       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0716 23:17:44.462797       1 main.go:90] Received GetInfo call: &InfoRequest{}
I0716 23:17:58.301288       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I0716 23:34:37.095442       1 main.go:90] Received GetInfo call: &InfoRequest{}
I0716 23:34:37.124833       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Environment:

csi-vsphere version: 3.30
vsphere-cloud-controller-manager version: 1.30.1 (1.29.1 gave error image pull, this might be my issue. Going to try to get a version in 1.29.x as next step)
Kubernetes version: 1.29.6
vSphere version: 8.02
OS (e.g. from /etc/os-release): Ubuntu 22.04
Kernel (e.g. uname -a): 5.15.0-116-generic
Install tools: kubeadm
Others: Containerd

Thank you.

wiredcolony commented 1 month ago

I just completed creating another cluster with vsphere CPI and CSI. All versions are within version skew limits. I am still getting a similar issue as I mentioned above. All PVCs bind, but I still get a mounting error.

Here is the test pod events:

Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               7m22s                 default-scheduler        Successfully assigned default/task-pv-pod to kadm-wk-1-a
  Normal   SuccessfulAttachVolume  7m21s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-00ecf48b-5f4d-4998-9a3b-8e5a40713b08"
  Warning  FailedMount             64s (x11 over 7m16s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-00ecf48b-5f4d-4998-9a3b-8e5a40713b08" : rpc error: code = NotFound desc = disk: 6000c295272288d128b6e94eed08a9c9 not attached to node

I suspected permission issues, so I gave admin privileges to the user. I don't know what else to try. Thanks in advance for the help.

P.S.--Kubernetes version in new cluster is now 1.30.3.

wiredcolony commented 1 month ago

Updated: CSI Version 3.3.1

UPDATE: I was able to rule out vSphere issues (i.e. permissions, configurations, etc.) by successfully creating, attaching and mounting a PVC/PV to a pod in a cluster built with Rancher (RKE1.28.11).

I am trying to figure out what to do next. I noticed in the Rancher cluster, the CSI has 3 controllers whereas my cluster only runs 1 controller. I tried running a couple more to mimic Rancher, however I believe I need to add 2 control-plane nodes for this to work (I tried already and the pods stay in a "pending" state). Which is an oddity because there are 3 running on Rancher and there is only one Control-Plane node in that cluster as well.

Next steps: Going to add 2 control-plane nodes and run 2 more replicas of the CSI-controller for a total of 3. Will see if that has any effect.

wiredcolony commented 1 month ago

Still curious about this issue. I would like to get some information on how to troubleshoot or resolve this issue. Adding two more control planes and CSI controllers had zero effect on the issue. Thank you.

zeddit commented 1 month ago

@wiredcolony facing the same issue, have you found a solution? I am running rke2 1.28.11 with vsphere-csi version of rancher-vsphere-csi-3.1.2-rancher400, 3.1.2-rancher4

wiredcolony commented 1 month ago

@wiredcolony facing the same issue, have you found a solution?

I am running rke2 1.28.11 with vsphere-csi version of rancher-vsphere-csi-3.1.2-rancher400, 3.1.2-rancher4

I have not. However, my issue is with vanilla K8s via Kubeadm.

I also have a rancher server and the PV provisioning and mounting worked perfectly with the same vSphere setup. I tested it with RKE2 and the embedded rancher VMware CSI/CPI (not the rancher apps/helm CSI/CPI).

Double check your vSphere permissions or try using vSphere admin username and password to easily troubleshoot permission issues. If admin account works, you know it is a permission issue.

Also, make sure the username is the full username "username@vsphere.local" when entering in Rancher.

Hopefully someone from the CSI team takes the case. I would love to figure out what is going on. Let me know if you have any questions or get it working. Thanks!

zeddit commented 1 month ago

@wiredcolony great thanks for your patient reply, really learned a lot! My problem is solved and the root cause is that I didn't give my VM an option of disk.enableUUID=TRUE.

besides, have you ever tried a hybrid cluster, which I mean there are some nodes that are not running on top of vSphere. Could these nodes be added into the k8s cluster and use vsphere-csi storage class. If could be, does it have some requirements on the network connectivity for those non-vSphere nodes. thanks again.

wiredcolony commented 1 month ago

@wiredcolony great thanks for your patient reply, really learned a lot! My problem is solved and the root cause is that I didn't give my VM an option of disk.enableUUID=TRUE.

besides, have you ever tried a hybrid cluster, which I mean there are some nodes that are not running on top of vSphere. Could these nodes be added into the k8s cluster and use vsphere-csi storage class. If could be, does it have some requirements on the network connectivity for those non-vSphere nodes. thanks again.

Glad you got things up and running. disk.enableUUID=TRUE will definitely be an issue for mounting. All my nodes are apart of VMware. I thought maybe the issue was versioning, but the issue persists. Hopefully someone could give me some troubleshooting tips soon.

pbs-jyu commented 1 month ago

I'm having the same problem with "MountVolume.MountDevice failed for volume "pvc-xx" : rpc error: code = NotFound desc = disk: xxx not attached to node" I have not figured out why yet. But found something may lead to the fix. testing restore the cluster with Velero, end up have to with restore cluster nodes with cluster nodes/VM backup. Saw the complained disk: xxx actually still attached to the original bad cluster nodes/VMs. I also used the same datastore for two different K8s clusters for testing.

pbs-jyu commented 1 month ago

also saw the same issue here: https://www.reddit.com/r/kubernetes/comments/1e8v7nr/vsphere_cpicsi_error_pod_event/

pbs-jyu commented 1 month ago

The following fixed part of my problem: Ø kubectl describe pod : Warning FailedMount 2m50s (x781 over 26h) kubelet MountVolume.MountDevice failed for volume "pvc-xx9" : rpc error: code = NotFound desc = disk: xx not attached to node

Ø kubectl describe volumeattachments csi-xxx Status: Attach Error: Message: rpc error: code = Internal desc = failed to attach disk: "sss" with node: "xx" err failed to attach cns volume: ... "Volume with ID xx is not registered as CNS Volume. Error message: The input volume xx is not registered as a CNS volume.

kubectl get volumeattachments kubectl edit volumeattachments csi-xxx

comment out the following:

finalizers:

external-attacher/csi-vsphere-vmware-com kubectl delete volumeattachments csi-xxx Then the volume attachment will be automatically created kubectl describe volumeattachments csi-xxx again. you should not see the errors any more.

Referenced: https://knowledge.broadcom.com/external/article/327470/persistent-volumes-cannot-attach-to-a-ne.html

pbs-jyu commented 1 month ago

kubectl delete volumeattachments csi-xxx did not work. The error comes back again. not sure why sometimes the volumes shows under datastore/Monitor/Container Volumes/ in vcenter, sometimes it disappear itself while I do nothing.

wiredcolony commented 2 days ago

My issue was solved by making sure VM advanced parameters has "disk.EnableUUID = True". Mine was set to "disk.EnabledUUID = True"(e.g. 'Enable' not 'Enabled'). Credit to reddit user PlexingtonSteel @ https://www.reddit.com/r/kubernetes/comments/1e8v7nr/comment/llbcsko/?context=3.

kubernetes-sigs / vsphere-csi-driver

PVCs are "Bound" via dynamic provisioning -- MountVolume.MountDevice failed / not attached to node #2954

comment out the following: