juicedata / juicefs-csi-driver

JuiceFS CSI Driver
https://github.com/juicedata/juicefs
Apache License 2.0
220 stars 84 forks source link

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

Open huangguoqiang opened 8 months ago

huangguoqiang commented 8 months ago

What happened:

volume for claim "senseauto-prod/df-30276-batch-82015-gh22t-juicefs-pv" I0229 02:12:42.613307 1 controller.go:562] CreateVolumeRequest {Name:pvc-b31a44e3-13e3-4738-bd85-bb66453cb7bf CapacityRange:required_bytes:322122547200 VolumeCapabilities:[mount:<fs_type:"ext4" mount_flags:"cache-size=0" > access_mode: ] Parameters:map[csi.storage.k8s.io/controller-expand-secret-name:juicefs-sc-secret csi.storage.k8s.io/controller-expand-secret-namespace:juicefs csi.storage.k8s.io/node-publish-secret-name:juicefs-sc-secret csi.storage.k8s.io/node-publish-secret-namespace:juicefs csi.storage.k8s.io/provisioner-secret-name:juicefs-sc-secret csi.storage.k8s.io/provisioner-secret-namespace:juicefs juicefs/mount-cpu-limit:5000m juicefs/mount-cpu-request:1000m juicefs/mount-image:juicedata/mount:ce-v1.1.1 juicefs/mount-memory-limit:32Gi juicefs/mount-memory-request:1Gi] Secrets:map[] VolumeContentSource: AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0229 02:12:42.811373 1 request.go:565] Throttling request took 12.391808769s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-51291945-deab-4050-9dc8-02885613300b I0229 02:12:43.011378 1 request.go:565] Throttling request took 12.591789695s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-14ee2521-2d7e-4f5f-9560-53f890f0e924 I0229 02:12:43.211369 1 request.go:565] Throttling request took 12.79177088s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-cb8bad6b-496f-47c8-a080-5894b8ea9402 I0229 02:12:43.411376 1 request.go:565] Throttling request took 12.991757135s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-defd0b4d-c1bb-448f-a38f-6baa4ee01f37 I0229 02:12:43.611363 1 request.go:565] Throttling request took 13.191731181s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-dec9d20d-e2c9-4465-9dc4-3ac6c6c2dfd0 I0229 02:12:43.811379 1 request.go:565] Throttling request took 13.391737676s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-7056fccb-dc36-4652-831c-dce277d8a148 I0229 02:12:44.011365 1 request.go:565] Throttling request took 13.591723711s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-c52611fe-88e3-4176-9110-1c397a56f161 I0229 02:12:44.211372 1 request.go:565] Throttling request took 13.791720147s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-e0cb880e-580e-43c1-a48a-6ee50919aac2 I0229 02:12:44.411375 1 request.go:565] Throttling request took 13.991711372s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-fa22183b-7814-4d3d-8208-df85d66978bc I0229 02:12:44.611376 1 request.go:565] Throttling request took 14.191697747s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-83e71675-9edc-4a2f-aee3-be1263f72ff8 I0229 02:12:44.811372 1 request.go:565] Throttling request took 14.38727893s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret I0229 02:12:44.814678 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume I0229 02:12:44.814689 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-cdf0b1f4-6292-4b4e-b215-b3730a535af8"} I0229 02:12:45.011381 1 request.go:565] Throttling request took 14.587279976s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret I0229 02:12:45.013416 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume I0229 02:12:45.013432 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-9111cf4e-516d-41de-9475-afd7ec7e7372"} I0229 02:12:45.211412 1 request.go:565] Throttling request took 14.786592611s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret I0229 02:12:45.214167 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume I0229 02:12:45.214185 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-dcb330d4-b39f-4a40-ab3f-4f25858c5a30"} I0229 02:12:45.222050 1 leaderelection.go:288] failed to renew lease juicefs/csi-juicefs-com: failed to tryAcquireOrRenew context deadline exceeded F0229 02:12:45.222073 1 leader_election.go:169] stopped leading juicefs-csi-controller-0_csi-provisioner.log

Environment:

JuiceFS CSI Driver version (which image tag did your CSI Driver use): v0.23.2 Kubernetes version (e.g. kubectl version): v1.21.11 +rke2r1 Object storage (cloud provider and region): ceph Metadata engine info (version, cloud provider managed or self maintained): redis sentinel

zxh326 commented 8 months ago

It looks like an issue with csi-provisioner.

Would you be willing to try using the juicefs provisioner to see if there are any problems in this case?

huangguoqiang commented 8 months ago

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod?

zxh326 commented 8 months ago

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod?

The provisioner only works when creating dynamic PVs.

huangguoqiang commented 8 months ago

There is another phenomenon in the previous failures: deleting PV will fail, and after deleting too many failed ones, the CSI provisioner Throttling request token will be triggered for 14.786592611s. If you want to restore, you need to manually delete the released status of the PV. May I ask if it is necessary to first solve the problem of PV deletion failure @zxh326

====================== W0229 09:15:17.988118 1 controller.go:965] Retrying syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299", failure 1 E0229 09:15:17.988143 1 controller.go:983] error syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299": rpc error: code = Internal desc = Could not delVol in juicefs: waitUtilJobCompleted: job juicefs-eb886745-delvol isn't completed: 2024/02/29 09:14:38.614049 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471]

WX20240229-185840

delete pod logs:

[root@SH-IDC1-10-198-5-32 ~]# kubectl logs -n juicefs juicefs-8ff6b8e3-delvol-8s7n4 -f 2024/02/29 11:19:33.940436 juicefs[7] : Meta address: redis://:@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497] redis: 2024/02/29 11:19:33 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379" 2024/02/29 11:19:33.953854 juicefs[7] : Ping redis latency: 913.101µs [redis.go:3593] 2024/02/29 11:19:33.954954 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471] 2024/02/29 11:19:34.009501 juicefs[7] : Volume is formatted as { "Name": "adjfs", "UUID": "cd03947d-b64e-4af0-b5ee-bfdb4b41f097", "Storage": "s3", "Bucket": "xxxx", "AccessKey": "xxxx", "SecretKey": "removed", "BlockSize": 4096, "Compression": "none", "KeyEncrypted": true, "TrashDays": 0, "MetaVersion": 1 } [format.go:508] 2024/02/29 11:19:34.464557 juicefs[32] : Meta address: redis://:@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497] redis: 2024/02/29 11:19:34 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379" 2024/02/29 11:19:34.470655 juicefs[32] : Ping redis latency: 786.09µs [redis.go:3593] 2024/02/29 11:19:34.472244 juicefs[32] : Data use s3://sh1984_juicefs_storage/adjfs/ [mount.go:605] 2024/02/29 11:19:34.472303 juicefs[32] : cache-size is 0, writeback and prefetch will be disabled [cached_store.go:540] 2024/02/29 11:19:35.475704 juicefs[32] : OK, adjfs is ready at /mnt/jfs [mount_unix.go:48]