Closed psavva closed 2 years ago
By default subvolumegroup called csi
is created in Ceph CSI : https://github.com/ceph/ceph-csi/blob/44da7ffb4e4d8861265f3ecd421cd06dfce9f34a/internal/cephfs/volume.go#L90
We recently also added support for mulitple subvolume groups : https://github.com/ceph/ceph-csi/pull/1175/files
I will retest on Monday morning with the log outputs
I am facing the same issue with the cephcluster version image ceph/ceph:v15.2.4. I am consuming ceph octopus externally into openshift 4.4.20. While cephblock pvc works well, cephfs pvc is got this problem (the hostnetwork true/false didnt solve it)
failed to provision volume with StorageClass "rook-cephfs": rpc error: code = InvalidArgument desc = an error occurred while running (2855) ceph [-m 10.101.100.177:6789,10.101.100.175:6789,10.101.100.176:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs get myfs --format=json]: exit status 2: Error ENOENT: filesystem 'myfs' not found
I think i just ran into the same issue with git clone --single-branch --branch v1.4.4 https://github.com/rook/rook.git
The PVC events were
│ Warning ProvisioningFailed 25m rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-7468b6bf56-8np4b_b6b48e17-c648-4784-aa62-bef29106e9b2 failed to provision volume with StorageClass "rook-cephfs": rpc │
│ error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume create myfs csi-vol-2c5341c9-fe81-11ea-a0e2-6a1d3f228513 1073741824 --group_name csi --mode 777 -m 172.24.158.166:6789,172.24.189.2 │
│ 17:6789,172.24.142.155:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped*** --pool_layout myfs-data0] │
│ Warning ProvisioningFailed 24m rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-7468b6bf56-8np4b_b6b48e17-c648-4784-aa62-bef29106e9b2 failed to provision volume with StorageClass "rook-cephfs": rpc │
│ error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume create myfs csi-vol-539bcd63-fe81-11ea-a0e2-6a1d3f228513 1073741824 --group_name csi --mode 777 -m 172.24.158.166:6789,172.24.189.2 │
│ 17:6789,172.24.142.155:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped*** --pool_layout myfs-data0]
So i tracked down the csi-cephfsplugin-provisioner
pod and had a look at the logs of csi-cephfsplugin
E0924 16:28:10.055734 1 volume.go:179] ID: 489 Req-ID: pvc-4e13b636-aeb2-4831-9a07-8bb3fab3245b failed to create subvolume csi-vol-eec93592-fe82-11ea-a0e2-6a1d3f228513(an error (exit status 2) occurred while running ceph args: [fs subvolume create myfs csi-vol-eec93592-fe82-11ea-a0e2-6a1d3f228513 1073741824 --group_name csi --mode 777 -m 172.24.158.166:6789,172.24.189.217:6789,172.24.142.155:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped*** --pool_layout myfs-data0]) in fs myfs
E0924 16:28:10.055782 1 controllerserver.go:90] ID: 489 Req-ID: pvc-4e13b636-aeb2-4831-9a07-8bb3fab3245b failed to create volume pvc-4e13b636-aeb2-4831-9a07-8bb3fab3245b: an error (exit status 2) occurred while running ceph args: [fs subvolume create myfs csi-vol-eec93592-fe82-11ea-a0e2-6a1d3f228513 1073741824 --group_name csi --mode 777 -m 172.24.158.166:6789,172.24.189.217:6789,172.24.142.155:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped*** --pool_layout myfs-data0]
E0924 16:28:10.078985 1 utils.go:163] ID: 489 Req-ID: pvc-4e13b636-aeb2-4831-9a07-8bb3fab3245b GRPC error: rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume create myfs csi-vol-eec93592-fe82-11ea-a0e2-6a1d3f228513 1073741824 --group_name csi --mode 777 -m 172.24.158.166:6789,172.24.189.217:6789,172.24.142.155:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped*** --pool_layout myfs-data0]
That didn't help me much but it looked like it was stuck in some way so I killed the csi-cephfsplugin-provisioner pod and when it started again it fixed everything up.
I just hit this with ceph-rook v1.4.5. I was pulling out my hair trying to figure out what was wrong, and the workaround in https://github.com/rook/rook/issues/4006#issuecomment-675964120 got me unstuck.
Me, too. Setting: Kubernetes 1.15, Rook v1.5.1, Ceph v15.2.6-20201119, Ceph CSI v3.1.2.
The Ceph CSI provisioner logs show
W1202 15:52:55.186209 1 driver.go:157] EnableGRPCMetrics is deprecated
E1202 15:53:22.843769 1 volume.go:109] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 failed to get subvolume info csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 in fs cephfs with Error: an error (exit status 2) and stdError (Error ENOENT: subvolume group 'csi' does not exist
) occurred while running ceph args: [fs subvolume info cephfs csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 --group_name csi -m 10.36.28.166:6789,10.36.27.73:6789,10.36.28.218:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-node --keyfile=***stripped***]. stdError: Error ENOENT: subvolume group 'csi' does not exist
E1202 15:53:22.843856 1 utils.go:163] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 GRPC error: rpc error: code = Internal desc = volume not found
E1202 15:55:26.525559 1 volume.go:109] ID: 8 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 failed to get subvolume info csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 in fs cephfs with Error: an error (exit status 2) and stdError (Error ENOENT: subvolume group 'csi' does not exist
) occurred while running ceph args: [fs subvolume info cephfs csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 --group_name csi -m 10.36.28.166:6789,10.36.27.73:6789,10.36.28.218:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-node --keyfile=***stripped***]. stdError: Error ENOENT: subvolume group 'csi' does not exist
E1202 15:55:26.525658 1 utils.go:163] ID: 8 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 GRPC error: rpc error: code = Internal desc = volume not found
E1202 15:57:30.183963 1 volume.go:109] ID: 12 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 failed to get subvolume info csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 in fs cephfs with Error: an error (exit status 2) and stdError (Error ENOENT: subvolume group 'csi' does not exist
) occurred while running ceph args: [fs subvolume info cephfs csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 --group_name csi -m 10.36.28.166:6789,10.36.27.73:6789,10.36.28.218:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-node --keyfile=***stripped***]. stdError: Error ENOENT: subvolume group 'csi' does not exist
E1202 15:57:30.184021 1 utils.go:163] ID: 12 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 GRPC error: rpc error: code = Internal desc = volume not found
E1202 15:59:33.816252 1 volume.go:109] ID: 16 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 failed to get subvolume info csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 in fs cephfs with Error: an error (exit status 2) and stdError (Error ENOENT: subvolume group 'csi' does not exist
) occurred while running ceph args: [fs subvolume info cephfs csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 --group_name csi -m 10.36.28.166:6789,10.36.27.73:6789,10.36.28.218:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-node --keyfile=***stripped***]. stdError: Error ENOENT: subvolume group 'csi' does not exist
E1202 15:59:33.816321 1 utils.go:163] ID: 16 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 GRPC error: rpc error: code = Internal desc = volume not found
E1202 16:01:37.485937 1 volume.go:109] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-72d97eab-49ab-11ea-8a6b-42ade5464898 failed to get subvolume info csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 in fs cephfs with Error: an error (exit status 2) and stdError (Error ENOENT: subvolume group 'csi' does not exist
) occurred while running ceph args: [fs subvolume info cephfs csi-vol-72d97eab-49ab-11ea-8a6b-42ade5464898 --group_name csi -m 10.36.28.166:6789,10.36.27.73:6789,10.36.28.218:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-node --keyfile=***stripped***]. stdError: Error ENOENT: subvolume group 'csi' does not exist
E1202 16:01:37.486001 1 utils.go:163]
and I am wondering if Rook is unable to arrange something, possibly related to these errors:
2020-12-02 16:20:02.545859 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1"}] "2020-12-02T16:19:00Z" "2020-12-02T16:19:00Z" "HEALTH_OK" {%!q(uint64=762339917824) %!q(uint64=123686912000) %!q(uint64=638653005824) "2020-12-02T16:18:21Z"}}
2020-12-02 16:20:03.235925 I | ceph-spec: ceph-file-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1"}] "2020-12-02T16:19:00Z" "2020-12-02T16:19:00Z" "HEALTH_OK" {%!q(uint64=762339917824) %!q(uint64=123686912000) %!q(uint64=638653005824) "2020-12-02T16:18:21Z"}}
Even more interesting, the first out of two desired application Pods got the CephFS volume attached and the application started while the other Pod is stuck in ContainerCreating
:
0s Warning FailedMount pod/my-application-5f4d4d8f6b-slhnn Unable to mount volumes for pod "my-application-5f4d4d8f6b-slhnn_default(0ce97f45-bdee-483f-a458-49fb240ced2a)": timeout expired waiting for volumes to attach or mount for pod "default"/"my-application-5f4d4d8f6b-slhnn". list of unmounted volumes=[vol1 default-token-m5pnx]. list of unattached volumes=[vol1 default-token-m5pnx]
0s Warning FailedMount pod/my-application-5f4d4d8f6b-ztmz2 MountVolume.MountDevice failed for volume "pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c" : rpc error: code = Internal desc = volume not found
Other thought: Regression in Ceph CSI v3.1.2?
Of course, this issue occurs in a production cluster only, not in any dev or lab cluster …
Please let me know if you thank that this is a different issue.
If you follow this:. https://github.com/rook/rook/issues/4006#issuecomment-593879132
Does the issue go away?
@psavva: No, unfortunately it did not help, I was still facing the problem after deleting PVC, PV, and storage class.
Due to limited time on the production system and the low amount of data, I simply backuped the data (since a single Pod was working, I was able to access the data on the Pod's host), deleted the rook-cephfs
storageclass, deleted the Ceph Filesystem, and re-installed everything.
Now, everything works again.
By the way, while reviewing my records, I found this in the logs after kubectl delete pvc
:
I1204 09:14:06.372275 1 controller.go:1453] delete "pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c": started
I1204 09:14:07.908046 1 controller.go:1468] delete "pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c": volume deleted
I1204 09:14:07.916878 1 controller.go:1518] delete "pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c": persistentvolume deleted
E1204 09:14:07.916906 1 controller.go:1521] couldn't create key for object pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c: object has no meta: object does not implement the Object interfaces
I1204 09:14:07.916932 1 controller.go:1523] delete "pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c": succeeded
E1204 09:14:07.916906 1 controller.go:1521] couldn't create key for object pvc-d5fc19a7-8312-4e3f-a870-204156ffa39c: object has no meta: object does not implement the Object interfaces
this error should not cause any issue its a warning, from the logs it looks like PV and backend subvolume is deleted
subvolume group 'csi' does not exist
@stephan2012 cephcsi will not delete the subvolumegroup once it's created. Not sure how the subvolume group got deleted. can you check ceph fs subvolumegroup ls cephfs
on the toolbox pod?
when cephcsi is pod starts it creates the subvolumegroup for the first time and keeps a count in in-memory not to try to create subvolume again, if you find subvolumegroup not found error during pvc creation operation, restarting provisioner pod helps (restarting provisioner pod will create the subvolumegroup again)
@Madhu-1 I will check, but can only do next week.
I'm hitting the same issue (I think... still digging into it) when following the example in the documentation—when I try getting a PVC, the provisioner just gets stuck in this loop:
W1215 05:22:04.418286 1 controller.go:943] Retrying syncing claim "283c0a58-1e41-4d23-b18a-58923f1c7566", failure 6
E1215 05:22:04.418323 1 controller.go:966] error syncing claim "283c0a58-1e41-4d23-b18a-58923f1c7566": failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-283c0a58-1e41-4d23-b18a-58923f1c7566 already exists
I1215 05:22:04.422917 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"drupal", Name:"drupal-files-pvc", UID:"283c0a58-1e41-4d23-b18a-58923f1c7566", APIVersion:"v1", ResourceVersion:"4570", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-283c0a58-1e41-4d23-b18a-58923f1c7566 already exists
I'll do some more debugging tomorrow and try to figure out what's going on. Using 1.5 branch latest, currently.
@Madhu-1 I turned out that I cannot access the system affected anymore this year. So, I cannot check ceph fs subvolumegroup ls cephfs
for the moment.
I have the same problem. Calico 3.14 on arm64 i use kube-proxy in ipvs mode and dualstack
I think the root case is that your filesystem was recreate, cephfs-csi stores a bool variable in memory for each cluster to mark whether a subvolumegroup has been created, cephfs-csi corresponds to a cluster via the clusterID field in StorageClass, which is set to the namespace of the cephcluster in rook, so when the CephFilesystem is rebuilt in the same namespace and StorageClass is recreated, in cephfs-csi the variable for the cluster is true, so there is no attempt to recreate the subvolumegroup
Are you able to reproduce this consistently with logs? If so, please can you attach it here, and hopefully the rook team can have a look, and produce a fix accordingly.
On Thu, Mar 4, 2021 at 3:53 PM Muyan0828 notifications@github.com wrote:
I think the root case is that your filesystem was recreate, cephfs-csi stores a bool variable in memory for each cluster to mark whether a subvolumegroup has been created, cephfs-csi corresponds to a cluster via the clusterID field in StorageClass, which is set to the namespace of the cephcluster in rook, so when the CephFilesystem is rebuilt in the same namespace and StorageClass is recreated, in cephfs-csi the variable for the cluster is true, so there is no attempt to recreate the subvolumegroup
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rook/rook/issues/6183#issuecomment-790632992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALDFJWBG6UBTL5KNFU52BLTB6GD7ANCNFSM4QOMV7VQ .
In production, Normally we don't expect the admin to delete and recreate the filesystem with the same name always. keeping the performance of PVC creation in mind we just create a subvolume group once in cephcsi driver per subvolume. if you delete and recreate the filesystem you need to restart the csidriver.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
@Madhu-1 Could I please request that the case and workaround is documented? It would avoid this issue from resurfacing again.
I have same error with ceph/ceph:v15.2.13 and calico.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Please re-open this bug. We're facing the same issue during cephfs recreating. cephfs recreating could be useful during ceph cluster initial deploying and configuration, when something going wrong. The issue will make more headache of this.
In production, Normally we don't expect the admin to delete and recreate the filesystem with the same name always. keeping the performance of PVC creation in mind we just create a subvolume group once in cephcsi driver per subvolume. if you delete and recreate the filesystem you need to restart the csidriver.
Why then rbd csi works another way? If I remove ceph cluster from the cloud - csi rbd provisioner and csi rbd plugin removed also. Then when I'm creating new ceph cluster - there is no such error.
@prazumovsky To summarize, the issue is that the csi driver needs to be restarted if the filesystem is re-created, right? Is the request to document this?
Please re-open this bug. We're facing the same issue during cephfs recreating. cephfs recreating could be useful during ceph cluster initial deploying and configuration, when something going wrong. The issue will make more headache of this.
IMO If the admin recreates filesystem with the same name, I suggest we just document this one. we don't want to check if the subvolume group always exists for each PVC creation which will impact the PVC creation performance.
In production, Normally we don't expect the admin to delete and recreate the filesystem with the same name always. keeping the performance of PVC creation in mind we just create a subvolume group once in cephcsi driver per subvolume. if you delete and recreate the filesystem you need to restart the csidriver.
Why then rbd csi works another way? If I remove ceph cluster from the cloud - csi rbd provisioner and csi rbd plugin removed also. Then when I'm creating new ceph cluster - there is no such error.
in rbd, we just create the rbd images. For cephcsi, it's a different case we create a subvolumegroup and update the local cache that subvolumegroup is created and no need to retry for recreation again and then we create the subvolumes.
Hi ,
i'm also facing this same issue when i import external ceph (16.2.6) inside rke2(1.22.4) rook version (v1.8.1) , OS Verion Ubuntu 20.04 and also disabled ufw on all rke2 nodes and external ceph nodes.
root@ip-10-0-0-111:/home/ubuntu/rook/deploy/examples/csi/cephfs# kubectl -n rook-ceph logs csi-cephfsplugin-provisioner-874864dcb-rcfvl -c csi-cephfsplugin
E1230 10:46:29.027085 1 utils.go:185] ID: 61 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:33.036776 1 controllerserver.go:172] ID: 62 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:33.036828 1 utils.go:185] ID: 62 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:41.042477 1 controllerserver.go:172] ID: 63 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:41.042531 1 utils.go:185] ID: 63 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:57.052598 1 controllerserver.go:172] ID: 64 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:57.052647 1 utils.go:185] ID: 64 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:27.798255 1 controllerserver.go:172] ID: 66 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:27.798302 1 utils.go:185] ID: 66 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:29.059140 1 controllerserver.go:172] ID: 67 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:29.059188 1 utils.go:185] ID: 67 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:49:37.064327 1 controllerserver.go:172] ID: 70 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:49:37.064373 1 utils.go:185] ID: 70 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:50:04.949580 1 volume.go:163] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a failed to create subvolume group csi, for the vol csi-vol-62fa7f7f-695d-11ec-a830-826f0b6db424: rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:50:04.949702 1 controllerserver.go:100] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a failed to create volume pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a: rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:50:04.956644 1 utils.go:185] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Internal desc = rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:53:20.792641 1 controllerserver.go:172] ID: 74 Req-ID: pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a an operation with the given Volume ID pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a already exists
E1230 10:53:20.792690 1 utils.go:185] ID: 74 Req-ID: pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a already exists
i can provide more details if you want to.
Hi,
Related to above issue my observation was like below:
I can successfully create PVC if i create CephFS filesystem (volume) directly from external ceph cluster using below command
ceph fs volume create <FS_NAME>
but
When i use below filesystem.yaml file to do the same thing but via rook i got same error i described in this [link].(https://github.com/rook/rook/issues/6183#issuecomment-1002985100)
#################################################################################################################
# Create a filesystem with settings with replication enabled for a production environment.
# A minimum of 3 OSDs on different nodes are required in this example.
# kubectl create -f filesystem.yaml
#################################################################################################################
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: myfs
namespace: rook-ceph # namespace:cluster
spec:
# The metadata pool spec. Must use replication.
metadataPool:
replicated:
size: 3
requireSafeReplicaSize: true
parameters:
# Inline compression mode for the data pool
# Further reference: https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression
compression_mode:
none
# gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
# for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
#target_size_ratio: ".5"
# The list of data pool specs. Can use replication or erasure coding.
dataPools:
- name: replicated
failureDomain: host
replicated:
size: 3
# Disallow setting pool with replica 1, this could lead to data loss without recovery.
# Make sure you're *ABSOLUTELY CERTAIN* that is what you want
requireSafeReplicaSize: true
parameters:
# Inline compression mode for the data pool
# Further reference: https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression
compression_mode:
none
# gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
# for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
#target_size_ratio: ".5"
# Whether to preserve filesystem after CephFilesystem CRD deletion
preserveFilesystemOnDelete: true
# The metadata service (mds) configuration
metadataServer:
# The number of active MDS instances
activeCount: 1
# Whether each active MDS instance will have an active standby with a warm metadata cache for faster failover.
# If false, standbys will be available, but will not have a warm cache.
activeStandby: true
# The affinity rules to apply to the mds deployment
placement:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: role
# operator: In
# values:
# - mds-node
# topologySpreadConstraints:
# tolerations:
# - key: mds-node
# operator: Exists
# podAffinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
# topologyKey: kubernetes.io/hostname will place MDS across different hosts
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
# topologyKey: */zone can be used to spread MDS across different AZ
# Use <topologyKey: failure-domain.beta.kubernetes.io/zone> in k8s cluster if your cluster is v1.16 or lower
# Use <topologyKey: topology.kubernetes.io/zone> in k8s cluster is v1.17 or upper
topologyKey: topology.kubernetes.io/zone
# A key/value list of annotations
annotations:
# key: value
# A key/value list of labels
labels:
# key: value
resources:
# The requests and limits set here, allow the filesystem MDS Pod(s) to use half of one CPU core and 1 gigabyte of memory
# limits:
# cpu: "500m"
# memory: "1024Mi"
# requests:
# cpu: "500m"
# memory: "1024Mi"
# priorityClassName: my-priority-class
# Filesystem mirroring settings
# mirroring:
# enabled: true
# list of Kubernetes Secrets containing the peer token
# for more details see: https://docs.ceph.com/en/latest/dev/cephfs-mirroring/#bootstrap-peers
# peers:
#secretNames:
#- secondary-cluster-peer
# specify the schedule(s) on which snapshots should be taken
# see the official syntax here https://docs.ceph.com/en/latest/cephfs/snap-schedule/#add-and-remove-schedules
# snapshotSchedules:
# - path: /
# interval: 24h # daily snapshots
# startTime: 11:55
# manage retention policies
# see syntax duration here https://docs.ceph.com/en/latest/cephfs/snap-schedule/#add-and-remove-retention-policies
# snapshotRetention:
# - path: /
# duration: "h 24"
so to solve this i tried to execute command (ceph fs volume create
and my pvc got bounded properly .
is this the right way to solve it ?
also when i see the logs on toolbox-job it shows something like this
# kubectl -n rook-ceph-external logs rook-ceph-toolbox-job--1-sgk5x
Volume created successfully (no MDS daemons created)
wanted to ask that do we need saperate MDS to be assigne with every filesystem we create in ceph ??
IMO If the admin recreates filesystem with the same name, I suggest we just document this one. we don't want to check if the subvolume group always exists for each PVC creation which will impact the PVC creation performance.
Hi @Madhu-1 , what will be to solution for this , do we need to remember that don't create FS with same using CR,
or we can expect a fix on this?
@mayank-reynencourt if you are creating filesystem using Rook CRD you should expect Rook to create the filesystem. please check rook operator logs and filesystem CR -oyaml for more details.
Yes if the filesystems are not created using Rook CRD the admin is expected to create the filesystem manually before creating the PVC.
@Madhu-1 , thanks for your reply , i'm still facing issue where my rook can create cephfilesystem on external ceph using CR but PVC is in pending state(Error: Connection timed out: "error calling ceph_mount"),
root@ip-10-0-0-111:/home/ubuntu/rook/deploy/examples/csi/cephfs# kubectl -n rook-ceph logs csi-cephfsplugin-provisioner-874864dcb-rcfvl -c csi-cephfsplugin
E1230 10:46:29.027085 1 utils.go:185] ID: 61 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:33.036776 1 controllerserver.go:172] ID: 62 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:33.036828 1 utils.go:185] ID: 62 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:41.042477 1 controllerserver.go:172] ID: 63 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:41.042531 1 utils.go:185] ID: 63 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:57.052598 1 controllerserver.go:172] ID: 64 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:46:57.052647 1 utils.go:185] ID: 64 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:27.798255 1 controllerserver.go:172] ID: 66 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:27.798302 1 utils.go:185] ID: 66 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:29.059140 1 controllerserver.go:172] ID: 67 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:47:29.059188 1 utils.go:185] ID: 67 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:49:37.064327 1 controllerserver.go:172] ID: 70 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:49:37.064373 1 utils.go:185] ID: 70 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a already exists
E1230 10:50:04.949580 1 volume.go:163] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a failed to create subvolume group csi, for the vol csi-vol-62fa7f7f-695d-11ec-a830-826f0b6db424: rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:50:04.949702 1 controllerserver.go:100] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a failed to create volume pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a: rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:50:04.956644 1 utils.go:185] ID: 54 Req-ID: pvc-ca70d814-03de-45c7-b843-a90e4cb13a2a GRPC error: rpc error: code = Internal desc = rados: ret=-110, Connection timed out: "error calling ceph_mount"
E1230 10:53:20.792641 1 controllerserver.go:172] ID: 74 Req-ID: pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a an operation with the given Volume ID pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a already exists
E1230 10:53:20.792690 1 utils.go:185] ID: 74 Req-ID: pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1b28e07a-0b3a-4b84-8398-7f6d3e65070a already exists
little help will be appreciated
can you paste the cephfileystem CR -oyaml output and also the ceph fs ls
output from the toolbox pod?
Hi @Madhu-1 ,
please fine below ceph fs ls
output from toolbox container and here is the link for filesystem.yaml i deployed
[rook@rook-ceph-tools-67d7dcc778-4qcrf /]$ ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
name: mayank, metadata pool: cephfs.mayank.meta, data pools: [cephfs.mayank.data ]
name: myfs, metadata pool: myfs-metadata, data pools: [myfs-replicated ]
# kubectl describe cephfilesystem -n rook-ceph-external
Name: myfs
Namespace: rook-ceph-external
Labels: <none>
Annotations: <none>
API Version: ceph.rook.io/v1
Kind: CephFilesystem
Metadata:
Creation Timestamp: 2022-01-04T12:55:57Z
Finalizers:
cephfilesystem.ceph.rook.io
Generation: 2
Managed Fields:
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:metadataPool:
.:
f:parameters:
.:
f:compression_mode:
f:replicated:
.:
f:requireSafeReplicaSize:
f:size:
f:metadataServer:
.:
f:activeCount:
f:activeStandby:
f:placement:
.:
f:podAntiAffinity:
.:
f:preferredDuringSchedulingIgnoredDuringExecution:
f:requiredDuringSchedulingIgnoredDuringExecution:
f:preserveFilesystemOnDelete:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-01-04T12:55:57Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"cephfilesystem.ceph.rook.io":
f:spec:
f:dataPools:
f:metadataPool:
f:erasureCoded:
.:
f:codingChunks:
f:dataChunks:
f:mirroring:
f:quotas:
f:statusCheck:
.:
f:mirror:
f:metadataServer:
f:resources:
f:statusCheck:
.:
f:mirror:
Manager: rook
Operation: Update
Time: 2022-01-04T12:55:57Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:phase:
Manager: rook
Operation: Update
Subresource: status
Time: 2022-01-04T12:56:16Z
Resource Version: 30454
UID: 6f805b0d-d924-496a-9355-17ea9fc761be
Spec:
Data Pools:
Erasure Coded:
Coding Chunks: 0
Data Chunks: 0
Failure Domain: host
Mirroring:
Name: replicated
Parameters:
compression_mode: none
Quotas:
Replicated:
Require Safe Replica Size: true
Size: 3
Status Check:
Mirror:
Metadata Pool:
Erasure Coded:
Coding Chunks: 0
Data Chunks: 0
Mirroring:
Parameters:
compression_mode: none
Quotas:
Replicated:
Require Safe Replica Size: true
Size: 3
Status Check:
Mirror:
Metadata Server:
Active Count: 1
Active Standby: true
Placement:
Pod Anti Affinity:
Preferred During Scheduling Ignored During Execution:
Pod Affinity Term:
Label Selector:
Match Expressions:
Key: app
Operator: In
Values:
rook-ceph-mds
Topology Key: topology.kubernetes.io/zone
Weight: 100
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app
Operator: In
Values:
rook-ceph-mds
Topology Key: kubernetes.io/hostname
Resources:
Preserve Filesystem On Delete: true
Status Check:
Mirror:
Status:
Phase: Ready
Events: <none>
# kubectl get cephfilesystem myfs -n rook-ceph-external -o yaml
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ceph.rook.io/v1","kind":"CephFilesystem","metadata":{"annotations":{},"name":"myfs","namespace":"rook-ceph-external"},"spec":{"dataPools":[{"failureDomain":"host","name":"replicated","parameters":{"compression_mode":"none"},"replicated":{"requireSafeReplicaSize":true,"size":3}}],"metadataPool":{"parameters":{"compression_mode":"none"},"replicated":{"requireSafeReplicaSize":true,"size":3}},"metadataServer":{"activeCount":1,"activeStandby":true,"annotations":null,"labels":null,"placement":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["rook-ceph-mds"]}]},"topologyKey":"topology.kubernetes.io/zone"},"weight":100}],"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["rook-ceph-mds"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"resources":null},"preserveFilesystemOnDelete":true}}
creationTimestamp: "2022-01-04T12:55:57Z"
finalizers:
- cephfilesystem.ceph.rook.io
generation: 2
name: myfs
namespace: rook-ceph-external
resourceVersion: "30454"
uid: 6f805b0d-d924-496a-9355-17ea9fc761be
spec:
dataPools:
- erasureCoded:
codingChunks: 0
dataChunks: 0
failureDomain: host
mirroring: {}
name: replicated
parameters:
compression_mode: none
quotas: {}
replicated:
requireSafeReplicaSize: true
size: 3
statusCheck:
mirror: {}
metadataPool:
erasureCoded:
codingChunks: 0
dataChunks: 0
mirroring: {}
parameters:
compression_mode: none
quotas: {}
replicated:
requireSafeReplicaSize: true
size: 3
statusCheck:
mirror: {}
metadataServer:
activeCount: 1
activeStandby: true
placement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
topologyKey: topology.kubernetes.io/zone
weight: 100
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
topologyKey: kubernetes.io/hostname
resources: {}
preserveFilesystemOnDelete: true
statusCheck:
mirror: {}
status:
phase: Ready
@Madhu-1 , also below command hangs on external ceph cluster as well as from toolbox
ceph fs subvolumegroup ls myfs
i tried with CNI:calico and with default CNI: canal, in both cases its stuck
i created 2 filesystem
1). mayank (using ceph cli # ceph fs create mayank
)
2) myfs (using CR)
below is the output regarding both filesystem subvolmes from ceph cluster
[rook@rook-ceph-tools-67d7dcc778-4qcrf /]$ ceph fs subvolumegroup ls cephfs
[]
# filesystem created via toolbox-job using ceph cli
[root@ip-10-0-0-224 /]# ceph fs subvolumegroup ls mayank
[
{
"name": "_deleting"
},
{
"name": "csi"
}
]
[root@ip-10-0-0-114 /]# ceph fs subvolumegroup ls myfs
Error ETIMEDOUT: error calling ceph_mount
maybe that will help
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
So what should i do if i encounter this? Delete the file system and create a new one with new name?
So what should i do if i encounter this? Delete the file system and create a new one with new name?
You need to create the subvolumegroup after creating the filesystem
@Zvezdoreel This bug is no longer valid with the latest Rook release as this is fixed.
Deviation from expected behavior:
csi driver is not creating the subvolumegroup called csi
Expected behavior: csi driver should create the subvolumegroup called csi
How to reproduce it (minimal and precise):
Create a CephFilesystem with the following: https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/filesystem.yaml
Afterwards, create a storageclass "rook-cephfs" which makes use the filesystem called "myfs" created above. https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass.yaml
Lastly, create the PVC and test deployment https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/csi/cephfs/kube-registry.yaml
File(s) to submit:
All files as per links above.
Environment:
uname -a
):rook version
inside of a Rook Pod): version 1.4ceph -v
): ceph/ceph:v15.2.4kubectl version
): 1.8ceph health
in the [Rook Ceph toolbox]. Health ok