Open clbx opened 12 months ago
Could you provide the support bundle?
cc @derekbit @shuo-wu
In the last step to reproduce, does the "restart node" simulate an unexpected power outage?
@mantissahz I cannot provide the support bundle due to some restrictions of our environments. I may be able to provide it if there's a method to share it privately to maintainers. Since its so large, its difficult to redact sensitive parts of it. If you have specific files I can grab out of it, that will make it easier. I apologize for the difficulty that introduces!
@derekbit all our EC2 nodes turn off at night as a cost saving measure. When the come back up in the morning, often Elasticsearch or Postgres I assume due to the high usage of those services compared to others.
Hello, I know the inability to provide the support bundle makes it a bit more challenging, but if there's anything we can provide specifically to help diagnose this issue, it would be greatly appreciated.
We are still experiencing this issue on 1.6.1 across multiple environments in AWS and OpenStack. We install 1.2.1 and upgrade from 1.2.1 -> Newest on installation.
Hello, I have the same issue with 1.6.2 and I can provide a bundle supportbundle_2dc1d722-9417-4a81-8374-46970b16b8b9_2024-10-03T11-18-15Z.zip
I have 33 clusters (vmware) with different kind of hardware, I ahve the issue with a strimzi kafka node volume, but also with some very small volume without a lot of IO.
Regards.
@derekbit all our EC2 nodes turn off at night as a cost saving measure. When the come back up in the morning, often Elasticsearch or Postgres I assume due to the high usage of those services compared to others.
@clbx Can you try to scale down the deployment before shutting down the machines and see if the issue remains?
Hello, I have the same issue with 1.6.2 and I can provide a bundle supportbundle_2dc1d722-9417-4a81-8374-46970b16b8b9_2024-10-03T11-18-15Z.zip
I have 33 clusters (vmware) with different kind of hardware, I ahve the issue with a strimzi kafka node volume, but also with some very small volume without a lot of IO.
Regards.
@w3blogfr happened after abnormal shutdown?
Describe the bug (🐛 if you encounter this issue)
Longhorn volumes sometimes get stuck with filesystem errors after our cluster comes back up from a nightly shutdown:
Attaching the volume to a node and then running
fsck.ext4
fixes the problem, but this is a large cluster and doing that manually is time consuming.To Reproduce
Create volumes for a service that has frequent read/writes, the most common deployments we've seen this behavior on is Elasticsearch and Postgresql.
Restart all nodes
Expected behavior
Longhorn volume is mounted normally.
Support bundle for troubleshooting
I cannot provide a full support bundle, but I can provide individual logs of anything specific.
Environment
Additional context
This is a very old installation of Longhorn from about 2 years ago and has been updated to 1.5.2 recently.