Closed janosmiko closed 1 year ago
@janosmiko You might want to SSH into a node and see the logs. Please refer to the Debug section in the readme.
@aleksasiriski Maybe you would know something about that specific issue?
Hi @mysticaltech ,
I found those logs in the node's journalctl (where the pod that needs the RWX volume). And actually when all the rest is working well (eg: a POD with RWO volume works as expected).
@janosmiko Please have a look if the nfs packages are installed. If not, make sure you are using the latest version of the nodes.
See the packer file how nfs is installed and do the same manually, if that solves it, we would have identified the problem.
Sure, it's installed:
dev-worker-1-autoscaled-medium-47604670eec959dc:/ # zypper search --installed-only nfs-client
Loading repository data...
Reading installed packages...
S | Name | Summary | Type
---+------------+---------------------------+--------
i+ | nfs-client | Support Utilities for NFS | package
Maybe this one is related? https://github.com/longhorn/longhorn/issues/6857
I just did a rollback to the previous version of MicroOS (it did an auto-upgrade midnight) and now the issue is solved on that node. I found this issue and I think they messed up the nfs client somehow... https://bugzilla.opensuse.org/show_bug.cgi?id=1214540
They definitely updated nfs-client
in the last couple of days from 2.6.3-39.4
to 2.6.3-39.5
, possibly the issue lies there.
For anyone who faces the same issue:
# check your current snapshots
ls -lah /.snapshots/
...
drwxr-xr-x. 1 root root 66 Oct 8 00:13 33
...
drwxr-xr-x. 1 root root 66 Oct 11 19:08 45
# check the snapshot you want to rollback to (in my case I rollback to 33, as the date of that snapshot was 3 days ago)
transactional-update rollback 33
# and you have to reboot when the task is done
reboot
If you want to disable system upgrade manually:
systemctl disable --now transactional-update.timer
@mysticaltech I think there's also a bug in the terraform module. I added these two lines after creating the cluster:
automatically_upgrade_k3s = false
automatically_upgrade_os = false
And none of them seems to work.
Same here. Disaster :-/
@janosmiko The upgrade flags are not retroactive, they take effect on the first deployment only. But see the upgrade section in the readme, you can disable it manually.
About the nfs-client, just freeze the version with zypper (via transactional-update shell).
After the version are frozen, you can let the upgrade be.
The upgrade flags are not retroactive, they take effect on the first deployment only. But see the upgrade section in the readme, you can disable it manually.
Can these be applied on the autoscaled nodes too? Eg: I created the cluster with automated upgrades = true, but I want to disable automated upgrades. If I change this in terraform and apply it on the cluster, that will make sure the autoscaled nodes will not be created using automated upgrades and I only have to disable it manually on the already existing nodes?
Freezing the nfs-client can be a good solution for the already existing nodes, but autoscaled (newly created) nodes will be created using the new package version. :/
I reported it here: https://bugzilla.opensuse.org/show_bug.cgi?id=1216201
@janosmiko Yes you can ssh into autoscaled nodes too. And what you could do is freeze the version at the packer level and publish the new snapshot (just apply packer again, see readme).
@janosmiko @Robert-turbo If you folks can give me the working version of the nfs-client I will freeze it at the packer level. These kind of packages do not need to get updated often. (then you can just recreate the packer image again, I will tell you how, just one command, so that all new nodes get a working version).
@mysticaltech
Working version:
S | Name | Type | Version | Arch | Repository
---+---------------------------------+---------+--------------------------------+--------+------------------------
i+ | nfs-client | package | 2.6.3-39.4 | x86_64 | (System Packages)
Problematic version:
S | Name | Type | Version | Arch | Repository
---+---------------------------------+---------+--------------------------------+--------+------------------------
i+ | nfs-client | package | 2.6.3-39.5 | x86_64 | openSUSE-Tumbleweed-Oss
Folks see solution here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/1018, and also PR merging right away to avoid this problem in the future.
Should be fixed in v2.8.0 but the image update is needed, please follow the steps laid out in https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/794.
(The solution was to install and freeze an older version of nfs-client), see the solution linked above for manual fixes).
Hi @mysticaltech , Could you reopen this issue and wait for some feedback from real users please? Just asking because it's a critical issue for those who use this in production...
And it still doesn't work. I tested it even by manually pinning the nfs-client version only on all my nodes and the RWX longhorn volumes are still not mounted. Also, Neil Brown mentioned in the related bugreport that this package update (nfs-client 2.6.3-39.4 -> 2.6.3-39.5) doesn't contain any changes, so the issue must be in another package.
https://bugzilla.suse.com/show_bug.cgi?id=1216201#c2
Also, installing x86-64 package on the arm snapshots will not work. https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/1026#pullrequestreview-1678861630
@mysticaltech It looks like downgrading the kernel-default package to 6.5.4-1.1 and a reboot solves the issue. You don't have to downgrade and pin the nfs-client package.
See the progress in the related bugreport: https://github.com/longhorn/longhorn/issues/6857
Thanks for the details @janosmiko, you are right. Will revert the change to pin the version and wait for more feedback on this issue.
The changes to the base images pinning nfs-client were reverted in v2.8.1.
@janosmiko As this is a longhorn bug, there is nothing else we can do here, closing for now. Thanks again for all the research and the info.
It's not a Longhorn bug, but actually a bug in the Linux kernel.
For anyone who faces the same issue and wants a real (and tested) solution...
SSH to all your worker nodes and run these commands:
transactional-update shell
zypper install -y --oldpackage https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm
zypper addlock kernel-default
exit
touch /var/run/reboot-required
If you'd like to make sure the autoscaled nodes also have this pinned kernel, delete the previous snapshots from hcloud, then modify the packer config with these:
install_packages = <<-EOT
set -ex
echo "First reboot successful, installing needed packages..."
transactional-update --continue pkg install -y ${local.needed_packages}
transactional-update --continue shell <<- EOF
setenforce 0
rpm --import https://rpm.rancher.io/public.key
zypper install -y https://github.com/k3s-io/k3s-selinux/releases/download/v1.4.stable.1/k3s-selinux-1.4-1.sle.noarch.rpm
zypper addlock k3s-selinux
zypper install -y "https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm"
zypper addlock kernel-default
restorecon -Rv /etc/selinux/targeted/policy
restorecon -Rv /var/lib
setenforce 1
EOF
sleep 1 && udevadm settle && reboot
EOT
and rerun packer init hcloud-microos-snapshots.pkr.hcl && packer build hcloud-microos-snapshots.pkr.hcl
Wait for the images to be built and finally run terraform apply
to update the cluster autoscaler config.
Description
Hi,
I'm using multiple clusters using this solution. Today, suddenly all the Longhorn RWX mounts stopped working in all of my clusters.
Previously I used Longhorn 1.5.1, now I rolled back to 1.4.3 but the same.
This is all I found in the logs:
I'm using a self installed longhorn, so it's disabled in the kube.tf, but this is the values.yaml I'm using. This also worked yesterday, so I'd say it's not related to the issue.
Do you have any ideas or advices on how to further debug it?
Kube.tf file
Screenshots
No response
Platform
Linux