siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
397 stars 23 forks source link

[bug] Unable to gather logs via talosctl when node has problems with networking #307

Closed samip5 closed 1 month ago

samip5 commented 1 month ago

Is there an existing issue for this?

Current Behavior

I'm experiencing an problem with networking, so that when I install my CNI that uses epbf, I seem to get an kernel error of some sort in logs, but it is NOT pushed to Omni as it doesn't seem to be able to connect to it anymore and I'm left with no way to connect to gather logs.

Expected Behavior

I expected there to be a way for me to gather logs even when machines are unable to connect to Omni.

Steps To Reproduce

  1. Deploy 1.7.4 Talos with Kubernetes 1.30.1 on Raspberry Pi 4s
  2. Install Cilium CNI with ebpf enabled and IPv6 (Omni is reacheable only via IPv6)
  3. See that it will most likely lose networking while the node IPv4 and IPv6 is still responding to ping.

What browsers are you seeing the problem on?

No response

Anything else?

Related to https://github.com/cilium/cilium/issues/32812

smira commented 1 month ago

We need kernel logs to understand what is going on, which node type it was, etc.

Omni requires SideroLink connection to controlplanes, not workers.

samip5 commented 1 month ago

Omni requires SideroLink connection to controlplanes, not workers.

Could you please clarify? If i'm running a single node cluster, and that said node is not able to connect to it due to epbf screwing up networking somehow? Should I be able to gather kernel logs even if siderolink is utlized, without going though Omni aka directly to the Talos API?

smira commented 1 month ago

If you run a single node cluster, it's a controlplane and a worker, so SideroLink connection is required for Omni.

There's no way to bypass Omni for Talos API access atm, see #191.

samip5 commented 1 month ago

Okay, so I was able to figure out the problem ish, Raspberry Pi 4 loses IPv6 after (and thus loses siderolink as well):

192.168.2.223: kern:     err: [2024-05-31T14:23:53.132888381Z]: ================================================================================
192.168.2.223: kern:     err: [2024-05-31T14:23:53.142751381Z]: UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:194:14
192.168.2.223: kern:     err: [2024-05-31T14:23:53.151197381Z]: index 8 is out of range for type '__u8 [*]'
192.168.2.223: kern: warning: [2024-05-31T14:23:53.157042381Z]: CPU: 3 PID: 5785 Comm: cilium-agent Not tainted 6.6.32-talos #1
192.168.2.223: kern: warning: [2024-05-31T14:23:53.165203381Z]: Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01 01/01/2024
192.168.2.223: kern: warning: [2024-05-31T14:23:53.174996381Z]: Call trace:
192.168.2.223: kern: warning: [2024-05-31T14:23:53.178181381Z]:  dump_backtrace+0x9c/0x100
192.168.2.223: kern: warning: [2024-05-31T14:23:53.183065381Z]:  show_stack+0x34/0x50
192.168.2.223: kern: warning: [2024-05-31T14:23:53.187271381Z]:  dump_stack_lvl+0x78/0xd0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.191805381Z]:  dump_stack+0x1c/0x30
192.168.2.223: kern: warning: [2024-05-31T14:23:53.195914381Z]:  __ubsan_handle_out_of_bounds+0xc0/0x100
192.168.2.223: kern: warning: [2024-05-31T14:23:53.201606381Z]:  longest_prefix_match.isra.0+0x200/0x258
192.168.2.223: kern: warning: [2024-05-31T14:23:53.207390381Z]:  trie_update_elem+0x160/0x3a0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.212257381Z]:  bpf_map_update_value+0xcc/0x2c8
192.168.2.223: kern: warning: [2024-05-31T14:23:53.217167381Z]:  map_update_elem+0x19c/0x328
192.168.2.223: kern: warning: [2024-05-31T14:23:53.221572381Z]:  __sys_bpf+0x834/0x1bf0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.225522381Z]:  __arm64_sys_bpf+0x34/0x58
192.168.2.223: kern: warning: [2024-05-31T14:23:53.229709381Z]:  invoke_syscall+0x90/0x128
192.168.2.223: kern: warning: [2024-05-31T14:23:53.233875381Z]:  el0_svc_common.constprop.0+0xec/0x118
192.168.2.223: kern: warning: [2024-05-31T14:23:53.239080381Z]:  do_el0_svc+0x34/0x50
192.168.2.223: kern: warning: [2024-05-31T14:23:53.242792381Z]:  el0_svc+0x4c/0x178
192.168.2.223: kern: warning: [2024-05-31T14:23:53.246316381Z]:  el0t_64_sync_handler+0x128/0x138
192.168.2.223: kern: warning: [2024-05-31T14:23:53.251050381Z]:  el0t_64_sync+0x1bc/0x1c0
192.168.2.223: kern:     err: [2024-05-31T14:23:53.255074381Z]: ================================================================================
192.168.2.223: kern:     err: [2024-05-31T14:23:53.264247381Z]: ================================================================================
192.168.2.223: kern:     err: [2024-05-31T14:23:53.273410381Z]: UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:194:14
192.168.2.223: kern:     err: [2024-05-31T14:23:53.281232381Z]: index 8 is out of range for type '__u8 [*]'
192.168.2.223: kern: warning: [2024-05-31T14:23:53.286866381Z]: CPU: 3 PID: 5785 Comm: cilium-agent Not tainted 6.6.32-talos #1
192.168.2.223: kern: warning: [2024-05-31T14:23:53.294678381Z]: Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01 01/01/2024
192.168.2.223: kern: warning: [2024-05-31T14:23:53.303935381Z]: Call trace:
192.168.2.223: kern: warning: [2024-05-31T14:23:53.306827381Z]:  dump_backtrace+0x9c/0x100
192.168.2.223: kern: warning: [2024-05-31T14:23:53.311028381Z]:  show_stack+0x34/0x50
192.168.2.223: kern: warning: [2024-05-31T14:23:53.314788381Z]:  dump_stack_lvl+0x78/0xd0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.318896381Z]:  dump_stack+0x1c/0x30
192.168.2.223: kern: warning: [2024-05-31T14:23:53.322651381Z]:  __ubsan_handle_out_of_bounds+0xc0/0x100
192.168.2.223: kern: warning: [2024-05-31T14:23:53.328070381Z]:  longest_prefix_match.isra.0+0x218/0x258
192.168.2.223: kern: warning: [2024-05-31T14:23:53.333481381Z]:  trie_update_elem+0x160/0x3a0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.337939381Z]:  bpf_map_update_value+0xcc/0x2c8
192.168.2.223: kern: warning: [2024-05-31T14:23:53.342667381Z]:  map_update_elem+0x19c/0x328
192.168.2.223: kern: warning: [2024-05-31T14:23:53.347052381Z]:  __sys_bpf+0x834/0x1bf0
192.168.2.223: kern: warning: [2024-05-31T14:23:53.350996381Z]:  __arm64_sys_bpf+0x34/0x58
192.168.2.223: kern: warning: [2024-05-31T14:23:53.355186381Z]:  invoke_syscall+0x90/0x128
192.168.2.223: kern: warning: [2024-05-31T14:23:53.359361381Z]:  el0_svc_common.constprop.0+0xec/0x118
192.168.2.223: kern: warning: [2024-05-31T14:23:53.364574381Z]:  do_el0_svc+0x34/0x50
192.168.2.223: kern: warning: [2024-05-31T14:23:53.368295381Z]:  el0_svc+0x4c/0x178
192.168.2.223: kern: warning: [2024-05-31T14:23:53.371829381Z]:  el0t_64_sync_handler+0x128/0x138
192.168.2.223: kern: warning: [2024-05-31T14:23:53.376571381Z]:  el0t_64_sync+0x1bc/0x1c0
192.168.2.223: kern:     err: [2024-05-31T14:23:53.380605381Z]: ================================================================================
192.168.2.223: user: warning: [2024-05-31T14:23:53.940437381Z]: [talos] machine is running and ready {"component": "controller-runtime", "controller": "runtime.MachineStatusController"}
192.168.2.223: user: warning: [2024-05-31T14:24:13.941928381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:24:36.184696381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:25:00.080260381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:25:10.231736381Z]: [talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
192.168.2.223: user: warning: [2024-05-31T14:25:23.955645381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:25:50.988570381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:26:20.607494381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:26:58.718111381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:00.588193381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:00.621848381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:27:03.723223381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:07.251348381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:10.962202381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:17.297776381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:34.179597381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:49.019389381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:27:51.737338381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:28:05.198873381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:28:31.234945381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:28:43.204110381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:29:19.870160381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:30:09.755798381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:30:28.093438381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:31:10.691467381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:31:25.357663381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:31:42.428429381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:32:37.466072381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:32:44.430319381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:33:41.451463381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:34:10.896618381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:34:48.738643381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:35:42.387931381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
192.168.2.223: user: warning: [2024-05-31T14:35:43.519770381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
192.168.2.223: user: warning: [2024-05-31T14:36:44.718638381Z]: [talos] controller failed {"component": "controller-runtime", "controller": "siderolink.ManagerController", "error": "error provisioning: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:4f9:c012:559::1]:8090: connect: network is unreachable\""}
smira commented 1 month ago

See https://github.com/siderolabs/talos/issues/8780, this is a Linux kernel/eBPF issue.

samip5 commented 1 month ago

See https://github.com/siderolabs/talos/issues/8780, this is a Linux kernel/eBPF issue.

But it's quite confusing as the same IPv6 breaking doesn't seem to happen on amd64 despite it printing similar things. :/

samip5 commented 1 month ago

Closing this as it's not relevant to Omni per se.