siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Unable to get Talos working with IPv6 #8115

Open Heracles31 opened 8 months ago

Heracles31 commented 8 months ago

Bug Report

Talos's doc does not contain any info about how to get IPv6 working. After guessing some parameters, it still does not work for multiple reasons. If I tried Cilium, everything here is when using Flannel with default settings but the pods and services subnet CIDRs. Both are specified with IPv4 and IPv6 ranges. IPv4 is the default 10.244/16 and IPv6 is a /60. Services CIDRs are also default, 10.96 and a /112 after I figured that from my other experiences.

Description

Setup is 9 nodes (3 controlplanes and 6 workers) using Talos's OVA v1.6.1. IPv4 DHCP reservation for each node (names / ip also registered in the DNS) DHCPv6 available in the segment. Router Advertisement is in managed mode (DHCPv6 but not SLAAC)

So fresh boot without a single bit changed in the OVA and all nodes come up with the proper IPv4, proper hostname and no IPv6.

Machine config files used for: --a VIP shared by the controlplanes (IPv6) --configuring the NTP Reference --configuring strict-arp --configuring pods and services CIDR --fixing each node's IPv4 and IPv6 addresses, define default route (IPv4) and nameservers (one v4 and one v6) --pointing to the extra manifest required by VMTools (downloaded from a local static URL) --add extras SANs in certs

Once applied, all nodes received their config and none are ready because cluster is not bootstrapped yet.

Last step is to bootstrap the cluster, pointing to the IPv6 address of the first controlplane.

Subnets too large

Trying to guess how to do it, I failed and gave up, moving to Ubuntu instead. With Ubuntu and Kubeadm, I kept trying. The difference is that I received useful error messages, like some subnet masks being too large.

Ex: for service subnet CIDR, Ubuntu / Kubeadm complains when using /64. It seems not to accept subnets larger than a /112.

While reading some other people debugs, some ended up with problems when their subnet CIDR ended up not large enough to be splitted among all of their nodes. They had a /64 subnet, the CNI was trying to give /66 ranges to each node but failed because there was not enough ranges for all the nodes (only 4 /66 in a /64 and they had like 8 nodes).

Here, I have to rely much more on horizontal scaling then vertical. As such, I have a total of 9 nodes. The /60 contains 16 /64, so I should be Ok despite, according to some documentation, the first of them is not used (not sure why...).

Internal node IPs

After deploying the cluster with settings that were accepted and did not push anything in crashloopbackoff, Talos was still not consistent.

Ex: All nodes where configured with static IPv4 and IPv6 from a machineconfig file applied before bootstrapping the cluster. The cluster (3 controlplanes and 6 workers) is then bootstrapped pointing to controller No1's IPv6 address. Once the cluster up and running, internal addressing is messed up.

kubectl get nodes -o wide

Shows all 3 controlplane with their static IPv6 address as internal IP while the 6 workers are showed with their IPv4. I had to restrict kubelet's valid subnet to IPv6 to force all nodes to show their IPv6 as internal address.

Validation webhook unreachable : No route to host

When I try to deploy MetalLB, it complains that the validation webhook is unreachable because there is no route to host. I received the same kind of error message with nginx ingress.

I modified the deployment of MetalLB, from IPv6 first / IPv4 second to IPv4 first / IPv6 second. Only then it came online functional thanks to IPv4. So again, this is evidence that it is specifically IPv6 that is not working.

One of my suspect is kubeprism but I have no evidence about this. Should that one be limited to IPv4, it would explain why webhook validation paths (metallb or nginx) were unreachable.

DHCPv6

Another problem is about getting fixed IPv6. Whenever I reset the cluster to maintenance mode, nodes come up with a different DHCPv6 UID, so static reservation does not work. When trying to fix that UID in machineconfig, it seems to be ignored. I suspect it is ignored because I did not put it in the proper format Talos is looking for. But again, because Talos's documentation about IPv6 is a big Zero, there is no way for me to figure out what format would work (or it format is indeed the problem or not).

To derive the UID from the MAC address would allow a different ID on each node while remaining consistent after reset to maintenance mode.

There is a ton of network config in a kubernetes cluster : --Addresses in the DMZ where all nodes are deployed (if they are all in the same subnet like mines) --pod subnet --service subnet --load balancer's IPs

Routing and Nating for letting pods connect resources outside the cluster --Even IPv6 needs NAT (when it is not supposed to support it...) --Load Balancer also needs some connectivity settings (L2 or BGP) and again, for both IPFamilies --IPv6 uses the Router Advertisement service and you are not supposed to point a route to a public IPv6. Routes should re-use the local-link addresses.

Because Talos uses Flannel as its default CNI, documentation should be complete at least for that one. If you wish to support another one like Cilium, how to configure it for IPv6 should be explained in detail too.

Logs

Environment

1.6.1

1.29

vmware-amd64.ova deployed in ESXi 6.7U3

smira commented 8 months ago

Thanks for the detailed report, but I guess it's mostly about missing documentation for IPv6?

It's not quite clear what other issues are, as they are missing necessary details - e.g. what was the configuration.

Heracles31 commented 8 months ago

In all cases, documentation is a problem for sure and that one is on your side.

But there is probably more than that like Kubeprism. Nowhere you mention it is compatible with IPv6. Is it supposed to be compatible or if indeed, it is a culprit like I think it is ? Same for fixing DHCP UID... Is it my format that is wrong (only documentation) or is the feature not working ? Only you can say if the only problem is limited to documentation. Good for you if it is and it will be easier to fix. If there are technical problems on top of that, you will have to fix it AND document the case.

But the problem is not documentation only. The fact that the DHCP UID is changed every time you reset a node is technical fact and a problem. The consequence is that you can not reserve a fixed IP address that will survive resets like you can for IPv4. The fact that the node does not request an IPv6 by default can be considered as another technical limitation.

As for the routes, I only provided 2 CIDRs (pods and services). Everything else if from Talos. So how can I end up with "No Route To Host" as the consequence of a poor documentation ? Routing should be handled and configured by Talos, not me. Even if there is something wrong in my CIDRs (ex: overlapping), Talos should detect and flag it instead of pushing a non-functional config. So again, I am pretty sure there is something technical here.

As for Flannel, your default CNI, it says that it requires a default IPv6 route on each node. The thing is, you are not supposed to force an IPv6 route to a "regular"IPv6 address. You must use the gateway's Local-Link IP instead. So again, that is a hard requirement that must come from you and which I can not compensate. In a node dashboard, there have been cases during my test where I did not see an IPv6 gateway. Was it only a screen size limitation or was the gateway really missing ?

Maybe can you provide a set of config file that will end up in a working dual stack cluster ? From that, I will be able to keep working my personal case. For now, I can not reverse engineer Talos more than that and I have to move back to Ubuntu until Talos improves on v6. Learning and mastering Kubernetes is already something, I can not do it while debugging and reverse engineering a black box at the same time...

Here are the machine config files I used (IPv6 prefix and DNS domain obfuscated). Configs are generated by talosctl and controlplane / all nodes are patched with these files. I gave you a set without the Kubelet valid subnet setting, so you should end up with the internal IPs mixed between controlplane (v6) and worker (v4) when doing kubectl get nodes -o wide.

No 1 - VIP

No 2 - NTP machine: time: disabled: false # Indicates if the time service is disabled for the machine. servers:

No 3 - CIDRs cluster: network: podSubnets:

No 4 - SANs machine: certSANs:

No 5 - Host specific (one of the host-specific config) machine: network: hostname: talos-c1 interfaces:

Heracles31 commented 8 months ago

Don't know why indent was removed for the last files.. These files are properly formatted over here...

attilaolah commented 6 months ago

My experience with IPv6 is that it tends to work more or less, as long as I also have an IPv4 address on one of the interfaces. Even without an IPv4 default route.

If I remove all Ipv4 addresses (e.g. disable DHCP and have SLAAC for IPv6 address configuration), control plane nodes would fail to become ready, and I'd see errors like this:

 user: warning: [2024-02-28T10:27:10.724334118Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?    
 fieldSelector=metadata.name%3D01&limit=500&resourceVersion=0\": dial tcp 127.0.0.1:7445: connect: connection refused"}                                                                                                                   

I tried disabling KubePrism, then the requests go to localhost:6443, which resolves to [::1]:6443, but I still get the same connection refused error. Kubelet is healthy, but the API server does not become healthy.

I'd expect to see "no route to host", not "connection refused", if there isn't even a loopback route, ~so I suspect there is a loopback IPv4 route through the loopback interface, but the API server (or KubePrism) does not bind to it? I'm not sure what's a good way to get the API server logs to figure out why it is not starting~ and indeed, the logs show that the API server won't start:

2001::41d2: 2024-02-28T10:44:30.3905092Z stderr F Error: invalid argument "" for "--advertise-address" flag: failed to parse IP: ""

Interestingly, if I start the cluster with a static IP address, even without an IPv4 default route, the node comes up, and remains healthy when I remove the IPv4 address.

attilaolah commented 6 months ago

Turns out what I need to resolve the issue is:

I'm filing a new issue about that second bit.

sanmai-NL commented 4 months ago

@Heracles31 Are you replicating a Kubernetes deployment using Talos Linux that had the required IPv6 functionality, or is this entirely new? In the latter case, please note what @attilaolah wrote. Kubernetes has some guidance and history of defects or documentation/interface limitations with IPv6 single and dual stack setups. Check e.g., https://kubernetes.io/docs/concepts/services-networking/dual-stack/.

attilaolah commented 4 months ago

I'll add my updates here, turns out the KubePrism issue is not really a problem, as long as we don't disable IPv6 on the loopback interface (which most folks probably don't do anyway). Eventually I got Talos with Cilium working, with IPv6-only node, pod and service subnets, with a few workarounds:

Overall it seems to be possible to have a functioning cluster with IPv6-only networking on all physical interfaces, and have IPv4 only on the loopback. I have a couple old configs over here that might point you in the right direction.

Heracles31 commented 4 months ago

I was trying to deploy it as a new install. I gave up and now run a cluster built with Kubeadm under Ubuntu.

I do not have much time to keep fighting my way through this, so I will keep my actual cluster which is working fine with dual stacks...

When the documentation will be updated with specific procedures for IPv6, I may try to migrate to Talos but for now, I will remain as is.

Thanks for your input on this one,

sanmai-NL commented 4 months ago

I'll add my updates here, turns out the KubePrism issue is not really a problem, as long as we don't disable IPv6 on the loopback interface (which most folks probably don't do anyway). Eventually I got Talos with Cilium working, with IPv6-only node, pod and service subnets, with a few workarounds:

  • The default time server, until a recent change to Cloudflare, did not have AAAA records. Recently Talos switched to Cloudflare so that is now solved, the default time server is IPv6-ready.

  • Some of the containers don't have IPv6 reachability. The simple workaround here is to run a local mirror, similarly how one would do in airgapped environments, and just crane over all the dependencies.

  • Cilium L2 advertisements don't work. (The default Flannel setup might work though, I haven't tried.) That's unrelated to Talos though.

Overall it seems to be possible to have a functioning cluster with IPv6-only networking on all physical interfaces, and have IPv4 only on the loopback. I have a couple old configs over here that might point you in the right direction.

The second issue is the most pressing for me. It's quite impactful in production deployments. I think those IPv4-only container image registries should be moved away from.

smira commented 4 months ago

The second issue is the most pressing for me. It's quite impactful in production deployments. I think those IPv4-only container image registries should be moved away from.

Talos supports registry mirrors natively, so put your own IPv6 mirror in front and use it.

Or, if you are willing to set it up for the community, it would be cool as well.

sanmai-NL commented 4 months ago

I know, but so then deployment takes a container image registry (mirror). A major extra component.

attilaolah commented 4 months ago

I'll add my updates here, turns out the KubePrism issue is not really a problem, as long as we don't disable IPv6 on the loopback interface (which most folks probably don't do anyway). Eventually I got Talos with Cilium working, with IPv6-only node, pod and service subnets, with a few workarounds:

  • The default time server, until a recent change to Cloudflare, did not have AAAA records. Recently Talos switched to Cloudflare so that is now solved, the default time server is IPv6-ready.
  • Some of the containers don't have IPv6 reachability. The simple workaround here is to run a local mirror, similarly how one would do in airgapped environments, and just crane over all the dependencies.
  • Cilium L2 advertisements don't work. (The default Flannel setup might work though, I haven't tried.) That's unrelated to Talos though.

Overall it seems to be possible to have a functioning cluster with IPv6-only networking on all physical interfaces, and have IPv4 only on the loopback. I have a couple old configs over here that might point you in the right direction.

The second issue is the most pressing for me. It's quite impactful in production deployments. I think those IPv4-only container image registries should be moved away from.

Sorry, that was a typo there — meant some of the container registries don't have IPv6 reachability.

To be clear, all containers have IPv6 reachability within Talos. Sorry for the confusion here, I really didn't notice the mistake in my wording.

bernardgut commented 1 month ago

@attilaolah

Eventually I got Talos with Cilium working

Great. I did not. And I am not the only one probably. My cluster comes online in "pending" with every API attempt resulting in error dialing backend: remote error: tls: internal error. Anyway: Do you mind summarizing here all settings that need to be changed (talos, k8s, cilium) for this to work properly ?

Much appreciated. Thanks

nazarewk commented 1 month ago

I had a similar series of issues after trying to switch to new set of machines (recreate cluster) and make them dual-stack at the same time.

I managed to get my setup to work by restricting serviceSubnets to /108 inspired by:

Update: I had to also add this, so the first node doesn't take the whole /64 IPv6 range:

cluster:
  controllerManager:
    extraArgs:
      node-cidr-mask-size-ipv6: "80" # defaults to /64 - whole assigned ULA subnet
skandragon commented 1 month ago

I had a similar series of issues after trying to switch to new set of machines (recreate cluster) and make them dual-stack at the same time.

I used your example and have things configured, but my nodes are not getting IPv6 CIDRs assigned.

These nodes have existed for a while, and I would rather not have to nuke them as they also are part of my CEPH in-cluster-cluster.

nazarewk commented 1 month ago

These nodes have existed for a while, and I would rather not have to nuke them as they also are part of my CEPH in-cluster-cluster.

I don't know exactly why, but I don't think it is easy (or possible at all) to modify cluster addressing after creation.

skandragon commented 3 weeks ago

I don't know exactly why, but I don't think it is easy (or possible at all) to modify cluster addressing after creation.

Turns out it is.

I did something basically like this:

I also had a little panic, as while doing this, I am making a split-brain cluster. The newly re-created nodes will not be able to communicate with the old nodes. This is because Flannel ends up using the newly created ipv6 tunnels for communication, while the old nodes will only use ipv4. The "fix" is to just press on...

After this, I am now able to get a working ipv4 + ipv6 with no data loss, and minimal downtime. My Ceph cluster can tolerate a single host going down in all cases, but I did have to push the issue about half way through as there were too many hosts down for the cluster to be healthy if one more was lost. Thus, the hard reset on those to force them to shut down (uncleanly) and recover in the new world.

After about an hour of doing this, everything was fully up and working.

bernardgut commented 2 weeks ago

personally I am just trying to get ipv6 working natively with cluster created from scratch on Omni. I followed your configs @nazarewk but apparently ~its not possible~ I cant. The machines dont boot and end up in a loop of :

20/08/2024 21:59:04
[talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://[fdae:41e4:649b:9303::1]:10000/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": read tcp [fdae:41e4:649b:9303:cb65:af18:c3fa:16a9]:45356->[fdae:41e4:649b:9303::1]:10000: read: connection reset by peer - error from a previous attempt: EOF"}
20/08/2024 21:59:18
[talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
20/08/2024 21:59:34
[talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
20/08/2024 21:59:39
[talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/v1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}

It seems the kubelet-csr-approver worked when it was IPv4 single stack but once you turn on dual stack and bind to ::1 addresses it stops working... Any ideas? I am basically using the same config as above with the Postfinance kublet-csr-approver. I also tried the one that is recommended in the Talos docs from alex1989hu but it wasn't working either.

Ive been trying for weeks to get dual stack working.

EDIT: Aditionally I cant reach the API either

$ k get nodes
Error from server (InternalError): an error on the server ("proxy error") has prevented the request from succeeding
smira commented 2 weeks ago

Omni doesn't impose any restrictions on the way you set up Kubernetes, including dual-stack.

You need to understand how to set up Kubernetes for dual-stack, make sure your nodes are dual stack, machine node IPs subnets are properly configured, and Kubernetes pod and service subnets are configured as well to be dual-stack.

In your case, most probably you have broken configuration in one or another place, and Kubernetes components don't start.

Talos has Troubleshooting docs to dig into each component. Also you can use omnictl support and dig into various logs to find the root cause.

bernardgut commented 2 weeks ago

Hello @smira I think I figured out the issue. But I don't know how to fix it. Thanks for the tip.

So when I ran omnict support I thought I might as well check and try to save you some time. I went and checked :

metadata:
    namespace: k8s
    type: NodeIPConfigs.kubernetes.talos.dev
    id: kubelet
    version: 1
    owner: k8s.NodeIPConfigController
    phase: running
    created: 2024-08-23T09:28:58Z
    updated: 2024-08-23T09:28:58Z
spec:
    validSubnets:
        - 2a02:XXXX:2403:3::/64
        - 10.2.0.1/16
    excludeSubnets:
        - 10.209.0.0/16
        - fda0:0:0:1000::/64
        - 10.96.0.0/16
        - fda0:0:0:2000::/108
        - 2a02:XXXX:2403:3::5

Everything seems correct : ValidSubnets are the nodes bare-metal CIDRs. excludeSubnets : The first 4 are the pod/service CIDRs as defined in cluster.machine.network. The last one is the VIP. OK it seems correct. Unfortunately I still get Error from server (InternalError): an error on the server ("proxy error") has prevented the request from succeeding (get pods). So no API.

Next I fiddled with all settings until the cluster came up and I could query the API: In the end the only combination that worked is if I put a wrong IPv6 in machine.kubelet.nodeIP.validSubnets:

metadata:
    namespace: k8s
    type: NodeIPConfigs.kubernetes.talos.dev
    id: kubelet
    version: 1
    owner: k8s.NodeIPConfigController
    phase: running
    created: 2024-08-23T08:42:51Z
    updated: 2024-08-23T08:42:51Z
spec:
    validSubnets:
        - 2a02:XXXX:2402:3::/64 #<--- THIS SUBNET DOESNT BELONG TO ME
        - 10.2.0.1/24
    excludeSubnets:
        - 10.209.0.0/16
        - fda0:0:0:1000::/64
        - 10.96.0.0/16
        - fda0:0:0:2000::/108
        - 2a02:XXXX:2403:3::5

Great the cluster comes up and I can query the API. After manually validating the CSRs for the nodes (because for some reason the kubernetes.io/kube-apiserver-client-kubelet CSR will get auto-approved/issued but the nodes CSRs will stay in pending indefinitely), I see that Cilium doesn't boot, and so the cluster stays in not ready state. I then went on a loop of fixing the issues one by one as they arose :

    namespace: network
    type: NodeAddresses.net.talos.dev
    id: accumulative
    version: 4
    owner: network.NodeAddressController
    phase: running
    created: 2024-08-23T08:41:14Z
    updated: 2024-08-23T08:41:45Z
spec:
    addresses:
        - 10.2.0.8/16
        - 2a02:XXXX:2403:3:be24:11ff:fe8a:845b/64
        - fdae:XXXX:649b:9303:cb65:af18:c3fa:16a9/64

this IPv6 subnet is not in the validSubnets above anymore. I removed it to put the "WRONG" one to be able to boot the cluster... Which brings me back to the original issue at the beginning of this post: I cannot access the API with a valid value for machine.kubelet.nodeIP.validSubnets. The API will return Error from server (InternalError): an error on the server ("proxy error") has prevented the request from succeeding (get pods).

Any ideas ?

I can send you the support.zip file on slack/email but I cant post it here because of the public IPv6 fields. I don't want to spam you so let me know.

Thanks B.

smira commented 2 weeks ago

I guess there many things you did, and many of them might be wrong.

E.g. it doesn't make sense to have both valid & exclude subnets set if you have just v4 & v6 external IPs. Simply keep valid IPs to be 0.0.0.0/0 and ::/0 so that it matches all IPv4 & IPv6, and only excluded stuff gets applied.

You should probably use KubePrism.

If you don't know how to manage kubelet server certificate issue process, just don't enable it.

Leave things more simple until you can figure out how to configure things properly. E.g. use default Talos Flannel instead of Cilium.

bernardgut commented 2 weeks ago

Ok I figured this out. By Pure luck too. And Its so dumb :

When writing your configuration file, you need to take care of the order in which you define your podSubnets and serviceSubnets params :

THIS :

cluster:
  ...
  network:
    podSubnets:
      - fda0:0:0:1000::/64 # random ULA subnet
      - 10.209.0.0/16
    serviceSubnets:
      # WARNING: IPv6 service subnet cannot be larger than /108 (previous discussion suggested /112)
      #   see linked configs from https://github.com/siderolabs/talos/issues/8115#issuecomment-2068026656
      - fda0:0:0:2000::/108 # random ULA subnet
      - 10.96.0.0/16

will work

But THIS:

cluster
  ...
  network:
    podSubnets:
      - 10.209.0.0/16
      - fda0:0:0:1000::/64 # random ULA subnet
    serviceSubnets:
      - 10.96.0.0/16
      # WARNING: IPv6 service subnet cannot be larger than /108 (previous discussion suggested /112)
      #   see linked configs from https://github.com/siderolabs/talos/issues/8115#issuecomment-2068026656
      - fda0:0:0:2000::/108 # random ULA subnet

will fail. The cluster will never come up and you will see (InternalError): an error on the server ("proxy error") has prevented the request from succeeding (get pods) when querying the API

Why ? kube-api wont start due to a missmatch between IP families of Services and External IP :

2024-08-23T09:32:25.871070123Z stderr F I0823 09:32:25.870881       1 options.go:221] external host was not specified, using 2a02:XXXX:2403:3:be24:11ff:fef1:13bf
2024-08-23T09:32:25.871623531Z stderr F E0823 09:32:25.871467       1 run.go:74] "command failed" err="service IP family \"10.96.0.0/16\" must match public address family \"2a02:XXXX:2403:3:be24:11ff:fef1:13bf\""

It doesnt seem like this is a Talos issue but rather upstream Kubernetes issue that was never truly fixed. So I think the fix here is to document this somewhere.

For the future people who come in this thread trying to make IPv6 work with Talos and Cilium, here is the TL:DR:

Thanks for helping me figure this out.

Cheers B.

PS : I will now try to ennable kubePrism again as, as you suggested, I dont think its a good idea to turn it off as suggested by other users above in this thread. It seems Cilium might rely on it for some features since they explicitly mention it in their Talos docs I will edit this post when I manage to make it work.