rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.06k stars 2.66k forks source link

MountDevice failed for volume pvc-f631... An operation with the given Volume ID already exists #4896

Open NicolaiSchmid opened 4 years ago

NicolaiSchmid commented 4 years ago

Is this a bug report or feature request?

Deviation from expected behavior: Kubernetes tries to attach the pvc to a pod and fails:

  Normal   SuccessfulAttachVolume  25m                  attachdetach-controller       AttachVolume.Attach succeeded for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d"
  Warning  FailedMount             23m                  kubelet, node3  MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             4m59s (x5 over 18m)  kubelet, node3  Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[volume default-token-4dbg8]: timed out waiting for the condition
  Warning  FailedMount             2m41s (x5 over 23m)  kubelet, node3  Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[default-token-4dbg8 volume]: timed out waiting for the condition
  Warning  FailedMount             32s (x18 over 23m)   kubelet, node3  MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-3e7b0d61-5335-11ea-a0a0-3e8b30a597e0 already exists

On other nodes in the cluster, the attach and mount works fine and as expected. How to reproduce it (minimal and precise):

Create an example cluster with a rbd-csi storage-class. Create a PVC and a pod, attaching the pvc. I think the issue lies somewhere in mismatching configuration, software, kernel modules, etc.

Environment: of the node trying to mount:

yleiymei commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

efenex commented 2 years ago

Wanted to add that this is on Ubuntu 20.04.3 LTS, and I checked for any active firewalling just in case but there is nothing besides the default docker/kubernetes rules. Every chain is set to accept. Including the complete set here in case the "KUBE-FIREWALL" drop rules could prove relevant (but unlikely since there are no hits):

Chain INPUT (policy ACCEPT 4002 packets, 2255K bytes)
 pkts bytes target     prot opt in     out     source               destination         
1395K  113M ACCEPT     udp  --  *      *       0.0.0.0/0            169.254.25.10        udp dpt:53
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            169.254.25.10        tcp dpt:53
  51M   27G KUBE-NODE-PORT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes health check rules */
  51M   27G KUBE-FIREWALL  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  31M   34G KUBE-FORWARD  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */
 247K   15M ACCEPT     all  --  *      *       10.233.64.0/18       0.0.0.0/0           
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            10.233.64.0/18      

Chain OUTPUT (policy ACCEPT 3737 packets, 1570K bytes)
 pkts bytes target     prot opt in     out     source               destination         
1395K  231M ACCEPT     udp  --  *      *       169.254.25.10        0.0.0.0/0            udp spt:53
    0     0 ACCEPT     tcp  --  *      *       169.254.25.10        0.0.0.0/0            tcp spt:53
  50M   19G KUBE-FIREWALL  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain DOCKER (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain DOCKER-ISOLATION-STAGE-1 (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain DOCKER-ISOLATION-STAGE-2 (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain DOCKER-USER (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
    0     0 DROP       all  --  *      *      !127.0.0.0/8          127.0.0.0/8          /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT

Chain KUBE-FORWARD (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
 2171 2238K ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack rule */ ctstate RELATED,ESTABLISHED

Chain KUBE-KUBELET-CANARY (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-NODE-PORT (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes health check node port */ match-set KUBE-HEALTH-CHECK-NODE-PORT dst
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

yleiymei commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

mayank-reynencourt commented 2 years ago

Hi all,

i'm also having the same issue on rook (v1.8.8) external ceph (16.2.7) controller.go:1337] provision "default/pvc-4" class "rc-fs-storage": started I0422 17:54:44.196780 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-4", UID:"9c09aa0d-cfc3-484c-b536-f3e686e54f52", APIVersion:"v1", ResourceVersion:"71143", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/pvc-4" W0422 17:54:44.200513 1 controller.go:934] Retrying syncing claim "9c09aa0d-cfc3-484c-b536-f3e686e54f52", failure 5 E0422 17:54:44.200544 1 controller.go:957] error syncing claim "9c09aa0d-cfc3-484c-b536-f3e686e54f52": failed to provision volume with StorageClass "rc-fs-storage": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-9c09aa0d-cfc3-484c-b536-f3e686e54f52 already exists I0422 17:54:44.200560 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-4", UID:"9c09aa0d-cfc3-484c-b536-f3e686e54f52", APIVersion:"v1", ResourceVersion:"71143", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rc-fs-storage": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-9c09aa0d-cfc3-484c-b536-f3e686e54f52 already exists

please help

i checked there is no connectivity issue between rke2(rook) and external ceph

i checked on ceph side and i observed that i cant run ceph fs subvolumegroup ls rc-mayank-cc-fs

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

yleiymei commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

mancubus77 commented 1 year ago

I had same issues as others in this thread, I'm not sure what exactly triggers it, but I was not able to deploy ACS on OpenShift. To fix that I had to do:

yleiymei commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

yleiymei commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

vyom-soft commented 1 year ago

I am facing this situation now. Not sure what is the solution. Any further update.

yleiymei commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

Pivert commented 1 year ago

Hi, in my case, the problem was ... Firewall !! To be more precise : the CSI plugin is using the protocol v2 on port TCP/3300, instead of the legacy protocol on TCP/6789. This took me a while to understand, since all the other clients were using the legacy protocol and working smoothly. I was not using rook, but an external Ceph, and got the error... Revelation when I looked on the FW logs :-)...

vsoch commented 1 year ago

I hit this problem today too https://github.com/rook/rook/issues/11617

yleiymei commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

yleiymei commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!

morganchenjp commented 1 year ago

I had same issue today. my Cluster had planned power outage last night, So we powered-off all nodes gracefully. When reboot all nodes today , I got the same issue.

Still no hint to fix it.

mancubus77 commented 1 year ago

I had same issue today. my Cluster had planned power outage last night, So we powered-off all nodes gracefully. When reboot all nodes today , I got the same issue.

Still no hint to fix it.

This can be because of wrong IP picked by OSD Check ceph osd dump and make sure OSD has right IP.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

thanhtoan1196 commented 6 months ago

any updates?

zhucan commented 6 months ago

any updates?

Please check the device wether it mapped? is there /dev/rbd* under /dev?

zhucan commented 6 months ago

https://www.mrajanna.com/troubleshooting-cephcsi/ there is a troubeshooting doc, maybe it can help you.

blackliner commented 5 months ago

I had the exact same error messages (an operation with the given Volume xxx already exists), while also having FS_DEGRADED. After restarting all mds, a crashed ganesha nfs pod and a few hours of waiting and not knowing what else to do, the FS_DEGRADED vanished and all PVCs mounted again.

66545shiwo commented 5 months ago

It can be solved by deleting all the pods that prefix with csi-. It may be caused by the k8s node‘s’time is out of sync.

johanssone commented 4 months ago

This is still an issue, FYI. Zero firewalling applied in test cluster atm, running cilium (non-host networking). I've tried all suggestions in this issue with zero luck of getting this fixed.

toabi commented 3 months ago

Had this issue this morning… what I did: delete all the csi-* pods (at once), restart all OSD pods (gracefully), restart all pods which showed this issue, … at some point it started working again (but I have no idea what exactly helped, at some point it looked like some network issue was also involved)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.