siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

kubelet fails to start with IPv6 #9072

Closed rothgar closed 1 month ago

rothgar commented 1 month ago

Bug Report

When bootstrapping a cluster in AWS with IPv6 the kubelet won't start.

Description

The kubelet won't start after etcd is bootstrapped.

talosctl services
NODE           SERVICE      STATE         HEALTH   LAST CHANGE   LAST EVENT
52.39.88.116   apid         Running       OK       9m9s ago      Health check successful
52.39.88.116   containerd   Running       OK       9m25s ago     Health check successful
52.39.88.116   cri          Running       OK       9m16s ago     Health check successful
52.39.88.116   dashboard    Running       ?        9m20s ago     Process Process(["/sbin/dashboard"]) started with PID 1718
52.39.88.116   etcd         Running       OK       8m58s ago     Health check successful
52.39.88.116   kubelet      Initialized   ?         ago          <none>
52.39.88.116   machined     Running       OK       9m31s ago     Health check successful
52.39.88.116   syslogd      Running       OK       9m30s ago     Health check successful
52.39.88.116   trustd       Running       OK       9m11s ago     Health check successful
52.39.88.116   udevd        Running       OK       9m29s ago     Health check successful
talosctl logs kubelet
ERROR: rpc error: code = Unknown desc = log "kubelet" was not registered

Logs

support.zip

Environment

rothgar commented 1 month ago

I created new nodes and in the launch template told it not to get an ipv6 address

"Ipv6AddressCount":0

In the AWS console it doesn't show an ipv6 address on the instance but does show 2 ipv4 addresses (public and private) image

When I look at the same host's dashboard I only see an ipv6 address image

here's output from get links

talosctl get links
WARNING: 34.212.174.206: server version 1.7.0 is older than client version 1.7.5
NODE             NAMESPACE   TYPE         ID         VERSION   TYPE       KIND        HW ADDR                                           OPER STATE   LINK STATE
34.212.174.206   network     LinkStatus   bond0      1         ether      bond        6e:d7:a9:b1:1a:9d                                 down         false
34.212.174.206   network     LinkStatus   dummy0     1         ether      dummy       6e:28:9a:6a:a8:61                                 down         false
34.212.174.206   network     LinkStatus   eth0       3         ether                  0a:91:a7:10:9a:25                                 up           true
34.212.174.206   network     LinkStatus   ip6tnl0    1         tunnel6    ip6tnl      00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00   down         false
34.212.174.206   network     LinkStatus   kubespan   10        nohdr      wireguard                                                     unknown      false
34.212.174.206   network     LinkStatus   lo         2         loopback               00:00:00:00:00:00                                 unknown      true
34.212.174.206   network     LinkStatus   sit0       1         sit        sit         00:00:00:00                                       down         false
34.212.174.206   network     LinkStatus   teql0      1         void                                                                     down         false
34.212.174.206   network     LinkStatus   tunl0      1         ipip       ipip        00:00:00:00                                       down         false
rothgar commented 1 month ago

I tried one more time with some additional network configuration and also adding extra args to the kubelet for node IP

    kubelet:                                                                                                                           
        image: ghcr.io/siderolabs/kubelet:v1.30.1                                                                                      
        defaultRuntimeSeccompProfileEnabled: true                                                                                      
        disableManifestsDirectory: true                                                                                                
        extraArgs:                                                                                                                     
            nodeIp:                                                                                                                    
                validSubnets:                                                                                                          
                    - "2600:1f14:1ca5:bd00::/64"                                                                                       
                    - "2600:1f14:1ca5:bd01::/64"                                                                                       
                    - "2600:1f14:1ca5:bd02::/64"                                                                                       
                    - "10.100.0.0/22"                                                                                                  
                    - "10.100.4.0/22"                                                                                                  
                    - "10.100.8.0/22"

I tried with and without quotes around the subnets and when I try to apply the config I get an unmarshal error

error applying new configuration: 1 error occurred:
        * 34.221.252.154: rpc error: code = InvalidArgument desc = error decoding  to *v1alpha1.Config: yaml: unmarshal errors:
  line 17: cannot unmarshal !!map into string
frezbo commented 1 month ago

on aws you'd need machine.kubelet.registerWithFQDN set to true

rothgar commented 1 month ago

That's not in our current AWS guide, why does it work there?

frezbo commented 1 month ago

That's not in our current AWS guide, why does it work there?

oh wait, in this case kubelet doesn't even come up yet, that's only needed if you plan to use a ccm

smira commented 1 month ago

I tried one more time with some additional network configuration and also adding extra args to the kubelet for node IP

    kubelet:                                                                                                                           
        image: ghcr.io/siderolabs/kubelet:v1.30.1                                                                                      
        defaultRuntimeSeccompProfileEnabled: true                                                                                      
        disableManifestsDirectory: true                                                                                                
        extraArgs:                                                                                                                     
            nodeIp:                                                                                                                    
                validSubnets:                                                                                                          
                    - "2600:1f14:1ca5:bd00::/64"                                                                                       
                    - "2600:1f14:1ca5:bd01::/64"                                                                                       
                    - "2600:1f14:1ca5:bd02::/64"                                                                                       
                    - "10.100.0.0/22"                                                                                                  
                    - "10.100.4.0/22"                                                                                                  
                    - "10.100.8.0/22"

This is definitely not how it should be in the machine config, not sure what you're trying to say even.

There's a kubelet --node-ip extra arg, which accepts a single or two IPs.

There's a Talos way to select IPs, but it is not under extrArgs: https://www.talos.dev/v1.7/reference/configuration/v1alpha1/config/#Config.machine.kubelet.nodeIP

smira commented 1 month ago

Getting back to the original question about the kubelet which doesn't start.

The root cause is because Talos can't establish a node IP, I will post here detailed analysis from the support.zip attached to the issue.

First, Talos establishes a configuration to look for node IPs:

metadata:
    namespace: k8s
    type: NodeIPConfigs.kubernetes.talos.dev
    id: kubelet
    version: 1
    owner: k8s.NodeIPConfigController
    phase: running
    created: 2024-07-26T18:18:26Z
    updated: 2024-07-26T18:18:26Z
spec:
    validSubnets:
        - 0.0.0.0/0
    excludeSubnets:
        - 10.244.0.0/16
        - 10.96.0.0/12

You can see here that Talos filters only IPv4 addresses (as the pod and service subnets in the machine configuration are IPv4 only), and excludes two ranges which are once again default pod/service subnets (as they can't be node IPs).

Second, we can look into the addresses available on the machine (overall):

metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: current
    version: 4
    owner: network.NodeAddressController
    phase: running
    created: 2024-07-26T18:18:10Z
    updated: 2024-07-26T18:18:27Z
spec:
    addresses:
        - 10.100.5.198/22
        - 52.39.88.116/32
        - 2600:1f14:3506:9601:b57b:69de:cab:ecc9/128
        - fdb5:d7ab:7c05:8102:7e:10ff:fed5:301/64

This current list provides all addresses, including two addresses (52.x and 2600:) which are not actually assigned to the node (that's AWS specifics).

The actual possible addresses are in the id routed:

metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: routed
    version: 3
    owner: network.NodeAddressController
    phase: running
    created: 2024-07-26T18:18:10Z
    updated: 2024-07-26T18:18:27Z
spec:
    addresses:
        - 10.100.5.198/22
        - fdb5:d7ab:7c05:8102:7e:10ff:fed5:301/64

One address is IPv6 (doesn't match the node IP filter) and another one falls into the default service subnet (10.96.0.0/12), so no addresses are available for the node IP, and Talos waits for reconfiguration or another address to be assigned which can be used as a node address.

The problem for IPv4 is documented here, the IPv6 problem is that machine configuration doesn't include pod/service subnets for IPv6.

rothgar commented 1 month ago

Thank you for the detailed explanation :+1: