longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.97k stars 589 forks source link

[BUG] Can't open blockdev on pvc #9381

Open wirwolf opened 1 week ago

wirwolf commented 1 week ago

Describe the bug

Kubernetes can not mount PVC to the pod.

MountVolume.MountDevice failed for volume "pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/2aee5e3168ae66ffb6370f8c2327cfa99670f20799afc2d02734e18ab89fd687/globalmount Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/2aee5e3168ae66ffb6370f8c2327cfa99670f20799afc2d02734e18ab89fd687/globalmount: /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b already mounted or mount point busy. dmesg(1) may have more information after failed mount system call.

To Reproduce

Deplom mrometeus operanor with pvc for grafana, prometeus and alertmanager. And sometimes this error happens. I can fix this error only by removing PVC and creating a new

Expected behavior

PVC attached to the pod normally and the pod started

Support bundle for troubleshooting

Environment

Additional context

On dmesg log i see this error

[55156.642229] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55278.756608] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55400.920058] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55523.040555] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55645.189481] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55767.341775] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[55889.445965] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[56011.582131] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[56133.715999] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev
[56255.859577] /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: Can't open blockdev

Other tests

root@worker3:~# sfdisk  -l -- /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b
Disk /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: 5 GiB, 5368709120 bytes, 10485760 sectors
Disk model: VIRTUAL-DISK    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

also, i see the error in the UI: image

failed to list snapshot: cannot get client for volume pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b on node worker3.k8s: engine is not running

PS. I try to attach pvc in maintenance mode and i can do this only with these commands:

# udevadm info --name=/dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b
# udevadm trigger --action=remove /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b
# udevadm info --name=/dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b
# fsck.ext4 /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b
e2fsck 1.47.0 (5-Feb-2023)
/dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: recovering journal
Setting free inodes count to 327660 (was 327669)
Setting free blocks count to 1267939 (was 1268642)
/dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b: clean, 20/327680 files, 42781/1310720 blocks
# mount /dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b /mnt

but after detachment, the problem not fixed

Workaround and Mitigation

derekbit commented 1 week ago

Have you checked https://longhorn.io/docs/1.7.0/important-notes/#unable-to-attach-volumes-created-before-v152-and-v144?

wirwolf commented 1 week ago

Command output: Safe to upgrade to v1.7.0.

kubectl -n longhorn-system get engines.longhorn.io
pvc-4b323020-99a9-4ee1-8ad1-5de7673fc2d2-e-0   v1            running   worker3   instance-manager-b903b8ccc9c9ea7995838647bff301f3   longhornio/longhorn-engine:v1.7.0   21h
pvc-9110c94f-de6d-489c-9e64-bdc500dffce6-e-0   v1            running   worker3   instance-manager-b903b8ccc9c9ea7995838647bff301f3   longhornio/longhorn-engine:v1.7.0   87m
pvc-91f76d1b-d48b-402a-b97c-edb55f0ff8a6-e-0   v1            running   worker2   instance-manager-b82f0ab4097a85d8d396e69854864fbb   longhornio/longhorn-engine:v1.7.0   21h
pvc-9a0f12da-e1ea-4f5a-85c5-aed0e24c39ef-e-0   v1            running   worker3   instance-manager-b903b8ccc9c9ea7995838647bff301f3   longhornio/longhorn-engine:v1.7.0   18h

Also, i remove pvc pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b and after redeploy grafana create new pvc pvc-9110c94f-de6d-489c-9e64-bdc500dffce6 with the same problem

derekbit commented 1 week ago

OK.

/dev/longhorn/pvc-7eaecbeb-ae6b-4776-97a9-c8fcaa3b7e0b already mounted or mount point busy. 

Please check https://longhorn.io/kb/troubleshooting-volume-with-multipath/

wirwolf commented 1 week ago

I apply this manual always before installing Longhorn on the cluster. Can I try updating to 1.7.1 and check?

derekbit commented 1 week ago

I apply this manual always before installing Longhorn on the cluster.

All Longhorn nodes?

Can I try updating to 1.7.1 and check?

Sure.

derekbit commented 1 week ago

I apply this manual always before installing Longhorn on the cluster. Can I try updating to 1.7.1 and check?

Do you mean https://longhorn.io/kb/troubleshooting-volume-with-multipath/?

wirwolf commented 1 week ago

Yes, in all worker nodes I add block blacklist before install Longhorn.

PS. On 1.7.1, the problem was reproduced.

derekbit commented 1 week ago

Can you provide a support bundle?

wirwolf commented 1 week ago

I Provided a support bundle and removed the comment because he has private information about the cluster

wirwolf commented 1 week ago

Hey, @derekbit. Do you have any news on my issue?

derekbit commented 1 week ago

Didn't receive the support bundle.

BTW, can you check which process is using the block device with the error message ...already mounted or mount point busy..

wirwolf commented 4 days ago

@derekbit I send you an email with a support bundle in the attachment.