[BUG] Volumes stuck with filesystem errors, requires manual fsck

clbx commented 12 months ago

Describe the bug (🐛 if you encounter this issue)

Longhorn volumes sometimes get stuck with filesystem errors after our cluster comes back up from a nightly shutdown:

MountVolume.MountDevice failed for volume "pvc-d88f0bc7-8004-4498-98f0-15dd1739579a" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-d88f0bc7-8004-4498-98f0-15dd1739579a but could not correct them: fsck from util-linux 2.37.2 /dev/longhorn/pvc-d88f0bc7-8004-4498-98f0-15dd1739579a contains a file system with errors, check forced. /dev/longhorn/pvc-d88f0bc7-8004-4498-98f0-15dd1739579a: Inode 1704298 has an invalid extent node (blk 6849095, lblk 0) /dev/longhorn/pvc-d88f0bc7-8004-4498-98f0-15dd1739579a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)

Attaching the volume to a node and then running fsck.ext4 fixes the problem, but this is a large cluster and doing that manually is time consuming.

To Reproduce

Create volumes for a service that has frequent read/writes, the most common deployments we've seen this behavior on is Elasticsearch and Postgresql.

Restart all nodes

Expected behavior

Longhorn volume is mounted normally.

Support bundle for troubleshooting

I cannot provide a full support bundle, but I can provide individual logs of anything specific.

Environment

Longhorn version: 1.5.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.23.6+rke2r2
- Number of management node in the cluster: 5
- Number of worker node in the cluster: 45
Node config
- OS type and version: Red Hat Enterprise Linux 8.8
- Kernel version: 4.18.0-477.15.1.el8_8.x86_64
- CPU per node: 16
- Memory per node: 64GB
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: Unknown
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS EC2 VMs, OpenStack VMs
Number of Longhorn volumes in the cluster: 53
Impacted Longhorn resources:
- Volume names:

Additional context

This is a very old installation of Longhorn from about 2 years ago and has been updated to 1.5.2 recently.

mantissahz commented 12 months ago

Could you provide the support bundle?

cc @derekbit @shuo-wu

derekbit commented 12 months ago

In the last step to reproduce, does the "restart node" simulate an unexpected power outage?

clbx commented 12 months ago

@mantissahz I cannot provide the support bundle due to some restrictions of our environments. I may be able to provide it if there's a method to share it privately to maintainers. Since its so large, its difficult to redact sensitive parts of it. If you have specific files I can grab out of it, that will make it easier. I apologize for the difficulty that introduces!

@derekbit all our EC2 nodes turn off at night as a cost saving measure. When the come back up in the morning, often Elasticsearch or Postgres I assume due to the high usage of those services compared to others.

clbx commented 10 months ago

Hello, I know the inability to provide the support bundle makes it a bit more challenging, but if there's anything we can provide specifically to help diagnose this issue, it would be greatly appreciated.

clbx commented 6 months ago

We are still experiencing this issue on 1.6.1 across multiple environments in AWS and OpenStack. We install 1.2.1 and upgrade from 1.2.1 -> Newest on installation.

w3blogfr commented 2 months ago

Hello, I have the same issue with 1.6.2 and I can provide a bundle supportbundle_2dc1d722-9417-4a81-8374-46970b16b8b9_2024-10-03T11-18-15Z.zip

I have 33 clusters (vmware) with different kind of hardware, I ahve the issue with a strimzi kafka node volume, but also with some very small volume without a lot of IO.

Regards.

derekbit commented 1 month ago

@derekbit all our EC2 nodes turn off at night as a cost saving measure. When the come back up in the morning, often Elasticsearch or Postgres I assume due to the high usage of those services compared to others.

@clbx Can you try to scale down the deployment before shutting down the machines and see if the issue remains?

derekbit commented 1 month ago

Hello, I have the same issue with 1.6.2 and I can provide a bundle supportbundle_2dc1d722-9417-4a81-8374-46970b16b8b9_2024-10-03T11-18-15Z.zip

I have 33 clusters (vmware) with different kind of hardware, I ahve the issue with a strimzi kafka node volume, but also with some very small volume without a lot of IO.

Regards.

@w3blogfr happened after abnormal shutdown?

longhorn / longhorn