Closed sharkymcdongles closed 6 months ago
@sharkymcdongles No idea what could be happening, but I would suggest using our default cilium config instead. So remove cilium_values and try again.
This is a recurring issue I noticed over the last couple of weeks, still investigating. It's most likely something related to all of our custom networking + microos + hetzner. For the time being, disable autoupgrades.
And getting cilium to work well on Hetzner is super tricky, hence my above suggestion.
I took what is documented here:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/README.md#examples
Sadly, given the hetzner IP blacklist bs, using an egressgateway is the only way to ensure the cluster works with autoscaling because many images are stored in ghcr and hetzner ips are randomly blocked there. It's also needed for things like SMTP since many SMTP providers also block hetzner IPs.
I will do some digging and see if maybe I can find out why this happens. I disabled autoupgrades now for both nodes and k3s yet some nodes still had the same behavior.
So I did some digging and it looks like it still tried to do an upgrade leading to the NotReady nodes situation again. When I change autoupgrades to off does it not reflect it for already provisioned nodes? Do I need to remove kured? EDIT: okay I figured out that I do need to edit the nodes myself by running: systemctl --now disable transactional-update.timer
Checking the logs I see the upgrade runs then the CPU gets locked and NetworkManager gets stuck fully killing the networking for the node since it never recovers. I then see all these CPU stuck errors.
Mar 28 04:38:08 infra-large-btl kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [NetworkManager:2151]
Mar 28 04:38:08 infra-large-btl kernel: Modules linked in: algif_hash af_alg ext4 mbcache jbd2 udp_diag inet_diag ip_set xt_CT cls_bpf sch_ingre>
Mar 28 04:38:08 infra-large-btl kernel: xhci_pci xhci_pci_renesas libata aesni_intel xhci_hcd virtio_scsi crypto_simd sd_mod cryptd t10_pi sg u>
Mar 28 04:38:08 infra-large-btl kernel: CPU: 3 PID: 2151 Comm: NetworkManager Not tainted 6.7.2-1-default #1 openSUSE Tumbleweed e152b88f51363d1>
Mar 28 04:38:08 infra-large-btl kernel: Hardware name: Hetzner vServer/Standard PC (Q35 + ICH9, 2009), BIOS 20171111 11/11/2017
Mar 28 04:38:08 infra-large-btl kernel: RIP: 0010:virtnet_send_command+0x106/0x170 [virtio_net]
Mar 28 04:38:08 infra-large-btl kernel: Code: 74 24 48 e8 fc 6b b8 c6 85 c0 78 60 48 8b 7b 08 e8 0f 4c b8 c6 84 c0 75 11 eb 22 48 8b 7b 08 e8 20>
Mar 28 04:38:08 infra-large-btl kernel: RSP: 0018:ffffbf9c40853a08 EFLAGS: 00000246
Mar 28 04:38:08 infra-large-btl kernel: RAX: 0000000000000000 RBX: ffff999ec1f229c0 RCX: 0000000000000001
I am attaching my full log file from when this happened to see if maybe someone here can shine some light on it.
I wonder if maybe the networking can't handle autoupgrades or updates given my settings? No idea tbh, but it'd be nice if this were solved or someone knew. I will continue researching on my end, but I think more heads are better than 1.
It's possible. So try switching to default networking settings, remove cilium_values
and see if it works better.
@sharkymcdongles Any updates, did the suggestion work?
@mysticaltech I haven't moved back to default cilium settings yet because I am working on a new way of handling images not pulling without the egress gateway. I am currently evaluating using a squid proxy as a replacement. Since disabling autoupgrades though I have had no further issues.
Ok, great! The proxy solution sounds awesome. Don't hesitate to share in due time if you see fit.
We are narrowing down the automated upgrade issues in other threads, so will close this one for now.
Description
I notice many nodes become NotReady and don't come back without manually rebooting them from the CLI or cloud console.
Kube.tf file
Screenshots
No response
Platform
Linux