Intermittent Loss of Private IP Address on Hetzner Cloud Nodes

ksanfs commented 2 months ago

Bug Report

Description

We have observed an issue with Hetzner Cloud nodes where the private IP address assigned to the eth1 interface is sometimes lost after Hetzner performs maintenance on the nodes. This issue occurs intermittently and affects the stability and connectivity of the affected nodes.

Steps to Reproduce:

Deploy a Talos cluster on Hetzner Cloud.
Wait for Hetzner to perform maintenance on one or more nodes in the cluster.
After the maintenance is completed, check the status of the eth1 interface on the affected nodes.

Expected Behavior: The eth1 interface should retain its assigned private IP address after Hetzner maintenance, and the link should be in the "up" state.

Actual Behavior: After Hetzner maintenance, the eth1 interface loses its assigned private IP address. The link is shown as "up" in the linkspec but "down" in the output of talosctl get link. This results in a loss of connectivity for the affected nodes.

talosctl get linkspec -o yaml:

---
node: [redacted]
metadata:
    namespace: network
    type: LinkSpecs.net.talos.dev
    id: eth1
    version: 4
    owner: network.LinkMergeController
    phase: running
    created: 2024-06-20T07:20:59Z
    updated: 2024-06-20T10:20:10Z
    finalizers:
        - network.LinkSpecController
spec:
    name: eth1
    logical: false
    up: true
    mtu: 0
    kind: ""
    type: netrom
    layer: default

talosctl get link -o yaml:

node: [redacted]
metadata:
    namespace: network
    type: LinkStatuses.net.talos.dev
    id: eth1
    version: 5
    owner: network.LinkStatusController
    phase: running
    created: 2024-06-20T07:20:59Z
    updated: 2024-06-20T10:20:10Z
spec:
    index: 9
    type: ether
    linkIndex: 0
    flags: UP,BROADCAST,MULTICAST
    hardwareAddr: [redacted]
    permanentAddr: [redacted]
    broadcastAddr: ff:ff:ff:ff:ff:ff
    mtu: 1450
    queueDisc: pfifo_fast
    operationalState: down
    kind: ""
    slaveKind: ""
    busPath: "0000:07:00.0"
    driver: virtio_net
    driverVersion: 1.0.0
    productID: "0x1041"
    vendorID: "0x1af4"
    product: Virtio 1.0 network device
    vendor: Red Hat, Inc.
    linkState: false
    speedMbit: 4294967295
    port: Other
    duplex: Unknown

talosctl dmesg:

[redacted]: user: warning: [2024-06-20T07:21:03.977611887Z]: [talos] service[kubelet](Running): Health check successful
[redacted]: user: warning: [2024-06-20T07:21:06.459898887Z]: [talos] service[apid](Running): Health check successful
[redacted]: user: warning: [2024-06-20T07:21:06.967624887Z]: [talos] service[etcd](Running): Health check successful
[redacted]: user: warning: [2024-06-20T07:21:06.968264887Z]: [talos] task startAllServices (1/1): done, 6.129025667s
[redacted]: user: warning: [2024-06-20T07:21:06.971274887Z]: [talos] phase startEverything (16/16): done, 6.132637445s
[redacted]: user: warning: [2024-06-20T07:21:06.973528887Z]: [talos] boot sequence: done: 6.53367108s
[redacted]: user: warning: [2024-06-20T07:21:06.978070887Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
[redacted]: user: warning: [2024-06-20T07:21:06.980804887Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
[redacted]: user: warning: [2024-06-20T07:21:06.982362887Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
[redacted]: user: warning: [2024-06-20T07:21:07.485296887Z]: [talos] machine is running and ready {"component": "controller-runtime", "controller": "runtime.MachineStatusController"}
[redacted]: user: warning: [2024-06-20T07:21:07.486821887Z]: [talos] removing fallback entry {"component": "controller-runtime", "controller": "runtime.DropUpgradeFallbackController"}
[redacted]: user: warning: [2024-06-20T07:21:07.500528887Z]: [talos] META: saved 0 keys
[redacted]: kern:    info: [2024-06-20T07:21:13.929806887Z]: cilium_geneve: Caught tx_queue_len zero misconfig
[redacted]: user: warning: [2024-06-20T07:21:17.443345887Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}
[redacted]: user: warning: [2024-06-20T07:24:25.933652887Z]: [talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
[redacted]: user: warning: [2024-06-20T10:20:10.355566887Z]: [talos] removed address [redacted]/32 from "eth1" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
[redacted]: user: warning: [2024-06-20T10:20:10.363521887Z]: [talos] no suitable node IP found, please make sure .machine.kubelet.nodeIP filters and pod/service subnets are set up correctly {"component": "controller-runtime", "controller": "k8s.NodeIPController"}
[redacted]: user: warning: [2024-06-20T10:20:10.366614887Z]: [talos] controller failed {"component": "controller-runtime", "controller": "network.RouteSpecController", "error": "2 errors occurred:\n\t* error adding route: netlink receive: invalid argument, message {Family:2 DstLength:32 SrcLength:0 Tos:0 Table:0 Protocol:3 Scope:253 Type:1 Flags:0 Attributes:{Dst:[redacted] Src:[redacted] Gateway:<nil> OutIface:9 Priority:1024 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}\n\t* error adding route: netlink receive: network is unreachable, message {Family:2 DstLength:24 SrcLength:0 Tos:0 Table:0 Protocol:3 Scope:0 Type:1 Flags:0 Attributes:{Dst:[redacted] Src:[redacted] Gateway:[redacted] OutIface:9 Priority:1024 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}\n\n"}
[redacted]: user: warning: [2024-06-20T10:20:41.801463887Z]: [talos] service[etcd](Running): Health check failed: context deadline exceeded

Workaround:

Rebooting the affected nodes seems to resolve the issue temporarily, as the private IP address is reassigned to the eth1 interface after the reboot. Manually detaching the private IP on hetzner cloud website then adding it back resolves the issue temporarily too.

Additional Information:

It appears that the DHCP client on the affected nodes is not retrying to obtain an IP address after the link undergoes maintenance by Hetzner.
This issue has been observed multiple times and seems to occur intermittently.
The loss of the private IP address can cause disruptions to cluster communication and services running on the affected nodes.

Environment

Talos version: 1.7.4
Kubernetes version: 1.30.1
Platform: hcloud

NOBLES5E commented 2 months ago

We previously encountered this issue about a year ago, which resulted in a significant outage. As a consequence, we decided not to use Talos. Unfortunately, when we reached out to Hetzner support, they informed us that since Talos is not one of their officially supported operating systems, they were unable to assist us with the problem.

smira commented 1 month ago

It sounds weird, but if the linkState: is down as it is in the output you provided, it is equivalent to the cable being unplugged (with the physical network card), so Talos stops the DHCP client in this case (as the address is not usable).

The workaround might be to assign the IP statically in the machine config, but I believe even with that if the linkState is down, the link won't be still used to send the actual packets, so the fact whether the IP is assigned or not doesn't really matter.

hendrikheil commented 1 month ago

We're experiencing this as well. I can reproduce the exact same behavior, workaround and output @ksanfs provided earlier. Is there anything we can provide to help debug?

siderolabs / talos