siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 555 forks source link

Kubelet/etcd uses wrong IPv6 Address #9725

Open trevex opened 1 week ago

trevex commented 1 week ago

Bug Report

Description

When Talos is run in an IPv6 Single-Stack environment and is assigned multiple IPs by DHCP and RA (although this will most likely apply to Dual-Stack as well) the Kubelet will use the wrong Address.

In our case Talos is running in KubeVirt with the Passt network binding plugin and gets an IP via RA followed by an /128 IP from DHCPv6. Only the latter has full bi-directional connectivity.

The preferred /128 address has the flag permanent while the RA address has the flag mngmtmpaddr.

The permanent address should be preferred.

Logs

Relevant excerpts from omnictl support:

AddressStatuses

# cat fd01:cafe::5054:ff:fe1f:c7bd/resources/addressstatuses.net.talos.dev.yaml
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: eth0/fd01:cafe::5054:ff:fe1f:c7bd/64
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:53Z
    updated: 2024-11-14T15:52:53Z
spec:
    address: fd01:cafe::5054:ff:fe1f:c7bd/64
    linkIndex: 8
    linkName: eth0
    family: inet6
    scope: global
    flags: mngmtmpaddr
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: eth0/fd01:cafe::f14c:9fa1:8496:557f/128
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    address: fd01:cafe::f14c:9fa1:8496:557f/128
    linkIndex: 8
    linkName: eth0
    family: inet6
    scope: global
    flags: permanent
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: eth0/fe80::5054:ff:fe1f:c7bd/64
    version: 2
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:50Z
    updated: 2024-11-14T15:52:52Z
spec:
    address: fe80::5054:ff:fe1f:c7bd/64
    linkIndex: 8
    linkName: eth0
    family: inet6
    scope: link
    flags: permanent
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: lo/127.0.0.1/8
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:48Z
    updated: 2024-11-14T15:52:48Z
spec:
    address: 127.0.0.1/8
    linkIndex: 1
    linkName: lo
    family: inet4
    scope: host
    flags: permanent
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: lo/169.254.116.108/32
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    address: 169.254.116.108/32
    linkIndex: 1
    linkName: lo
    family: inet4
    scope: host
    flags: permanent
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: lo/::1/128
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:49Z
    updated: 2024-11-14T15:52:49Z
spec:
    address: ::1/128
    linkIndex: 1
    linkName: lo
    family: inet6
    scope: host
    flags: permanent
---
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: siderolink/fdae:41e4:649b:9303:2972:b262:4fad:b458/64
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2024-11-14T15:52:53Z
    updated: 2024-11-14T15:52:53Z
spec:
    address: fdae:41e4:649b:9303:2972:b262:4fad:b458/64
    linkIndex: 9
    linkName: siderolink
    family: inet6
    scope: global
    flags: permanent

NodeAddresses:

# cat fd01:cafe::5054:ff:fe1f:c7bd/resources/nodeaddresses.net.talos.dev.yaml
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: accumulative
    version: 4
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:47Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
        - fdae:41e4:649b:9303:2972:b262:4fad:b458/64
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: accumulative-no-k8s
    version: 2
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
        - fdae:41e4:649b:9303:2972:b262:4fad:b458/64
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: accumulative-only-k8s
    version: 1
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses: []
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: current
    version: 4
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:47Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
        - fdae:41e4:649b:9303:2972:b262:4fad:b458/64
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: current-no-k8s
    version: 2
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
        - fdae:41e4:649b:9303:2972:b262:4fad:b458/64
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: current-only-k8s
    version: 1
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses: []
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: default
    version: 1
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:53Z
    updated: 2024-11-14T15:52:53Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: routed
    version: 3
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:47Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: routed-no-k8s
    version: 2
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd/64
        - fd01:cafe::f14c:9fa1:8496:557f/128
---
metadata:
    namespace: network
    type: NodeAddresses.net.talos.dev
    id: routed-only-k8s
    version: 1
    owner: network.NodeAddressController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses: []

NodeIPs:

# cat fd01:cafe::5054:ff:fe1f:c7bd/resources/nodeips.kubernetes.talos.dev.yaml
metadata:
    namespace: k8s
    type: NodeIPs.kubernetes.talos.dev
    id: kubelet
    version: 1
    owner: k8s.NodeIPController
    phase: running
    created: 2024-11-14T15:52:55Z
    updated: 2024-11-14T15:52:55Z
spec:
    addresses:
        - fd01:cafe::5054:ff:fe1f:c7bd

Environment

trevex commented 1 week ago

A potential solution could be to sort the IPs by preferred flags here: https://github.com/siderolabs/talos/blob/e26d0043e022eccf5ea9c9d9b4a57e4bff1f80cc/internal/app/machined/pkg/controllers/network/node_address.go#L154C1-L155C1

However this would mean addresses in NodeAddress objects are sorted by preference rather than alphabetically.

If this is a valid solution I could draft up a PR.

smira commented 1 week ago

I agree it might be better for IPv6, but you can use also https://www.talos.dev/v1.8/introduction/prodnotes/#multihoming

trevex commented 1 week ago

I am not sure how this helps here. Both addresses are from the same subnet.

KubeVirt's Passt network binding (which is currently the only fully functional IPv6 option supporting the primary pod network) announces the Pod Subnet (of the hosting cluster) as Prefix via RA and Talos will derive a SLAAC/Temp and follow it up with DHCPv6.

This means the SLAAC and DHCPv6 assigned IP are in the same subnet. I don't see a reasonable subnet filter to specify.

The SLAAC address itself is not reachable by the underlying pod network of the KubeVirt hosting cluster. Using it for etcd or kubelet will break connectivity. This is stopping Talos from scaling beyond a single node in an IPv6 KubeVirt environment as an unreachable IP will be advertised.

Is sorting the IPs alphanumerical and by preference based on flags a suitable solution (on top of the existing filtering)? If so, the changes required should be minimal and I might be able to draft up a PR.

trevex commented 1 week ago

It might be worth mentioning that the kubelet will choose the correct IP if no node IP is specified. This is the case with a kubeadm setup based on KubeVirt. From my understanding the Kubelet is using https://github.com/kubernetes/apimachinery/blob/v0.31.2/pkg/util/net/interface.go#L468 under the hood to choose the address.

smira commented 1 week ago

I understand the issue, but I'd like to make sure we have a proper solution ground up for IPv6, so I don't want to rush into fixing this until we have a proper testbed for IPv6 we can use to ensure proper operations going forward.

I know it doesn't sound too much fun, but the proper IP can be selected with /128 match if the IP is known beforehand.

trevex commented 1 week ago

I know it doesn't sound too much fun, but the proper IP can be selected with /128 match if the IP is known beforehand.

Unfortunately the VM's IP is a Pod IP, so for KubeVirt IPv6 (omni-infra-provider-kubevirt) use-cases this is not an option and blocking adoption, but I understand the desire to find the best solution

smira commented 1 week ago

I think it does make sense to prefer IPv6 addresses based on flags (not sure if we can omit mngmtmpaddr completely from NodeAddresses ?)

sbrivio-rh commented 1 week ago

KubeVirt's Passt network binding (which is currently the only fully functional IPv6 option supporting the primary pod network) announces the Pod Subnet (of the hosting cluster) as Prefix via RA and Talos will derive a SLAAC/Temp and follow it up with DHCPv6.

By the way, passt does this because you can't "turn off SLAAC" while sending router advertisements (the M flag is set, but it doesn't tell a node to skip SLAAC). You can disable router advertisements with passt's --no-ra option, but then you'd be missing the route.

But passt also does this because it works with Linux, as addresses with the longest prefixes are preferred as source addresses, see __ipv6_dev_get_saddr() and ipv6_get_saddr_eval() (rule #8) in net/ipv6/addrconf.c for details.

Now, without making this as generic as the Linux kernel, I guess it would be anyway reasonable to pick the longest matching prefix as preferred address.

trevex commented 1 week ago

I think it does make sense to prefer IPv6 addresses based on flags (not sure if we can omit mngmtmpaddr completely from NodeAddresses ?)

Funny enough in our bare-metal Talos setup we do not use DHCPv6 so the SLAAC address is used. A preference based on longest matching prefix sounds like a reasonable approach.

sbrivio-rh commented 1 week ago

Funny enough in our bare-metal Talos setup we do not use DHCPv6 so the SLAAC address is used.

The main reason why passt implements a (minimalistic) DHCPv6 server is that, I've been told, having the same exact address inside and outside the guest is convenient for integration with some container-oriented service meshes that assume "host networking" (hence, addressing).

trevex commented 1 week ago

Funny enough in our bare-metal Talos setup we do not use DHCPv6 so the SLAAC address is used.

The main reason why passt implements a (minimalistic) DHCPv6 server is that, I've been told, having the same exact address inside and outside the guest is convenient for integration with some container-oriented service meshes that assume "host networking" (hence, addressing).

Yes, and it is also a necessity to run Kubernetes Clusters in KubeVirt either through CAPI or Omni/Talos.

@smira Does Talos have a "feature gate" functionality allowing us to hide the changed behaviour behind a feature gate?

smira commented 4 days ago

Yes, and it is also a necessity to run Kubernetes Clusters in KubeVirt either through CAPI or Omni/Talos.

@smira Does Talos have a "feature gate" functionality allowing us to hide the changed behaviour behind a feature gate?

Yes, we do have feature gates, if you could open a proposed PR, we can make a feature gate, and even enable it by default for new clusters on 1.9.

trevex commented 4 days ago

Over the weekend I figured there might be a (dirty) workaround for the Kubevirt use-case (will not help for bare-metal IPv6 use-cases involving DHCPv6): Spoofing the MAC address of VMs allows us to predict the IP, so we can blacklist it. Unfortunately blacklisting does not seem to be supported anymore. The documentation mentions the use of !, but this is not handled in code (anymore).

This will leave the node in a non-functional state:

 # cat fdae:41e4:649b:9303:9cd5:e54b:8120:4adb/resources/nodeipconfigs.kubernetes.talos.dev.yaml
metadata:
    namespace: k8s
    type: NodeIPConfigs.kubernetes.talos.dev
    id: kubelet
    version: 1
    owner: k8s.NodeIPConfigController
    phase: running
    created: 2024-11-18T11:29:36Z
    updated: 2024-11-18T11:29:36Z
spec:
    validSubnets:
        - '!fd01:cafe::dcad:ff:fe00:beaf/128'
    excludeSubnets:
        - fd90:cafe::/64
        - fd95:cafe::/108

This might be either outdated documentation or another bug report.

I'll start working on a PR to establish a preference for IPv6 IPs ASAP.