Closed aserraric closed 1 year ago
K3s fails to start in dual stack mode with message "IPv6 was enabled but no IPv6 address was found on node" even though the node has an IPv6 address.
Logs are great but can you show the ifconfig
or ip addr
output that shows the actual ipv4 and ipv6 address on the node? Without knowing what addresses the node actually does have and on which interfaces it's hard to disagree with the application is reporting.
Sorry, forgot about that. ip addr:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether e4:5f:01:29:43:bb brd ff:ff:ff:ff:ff:ff
inet 192.168.178.100/24 brd 192.168.178.255 scope global dynamic noprefixroute end0
valid_lft 848267sec preferred_lft 740267sec
inet6 fd00::5038:7f13:6fea:cb62/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 6796sec preferred_lft 3196sec
inet6 2003:f3:170d:8a00:c9ba:8cb4:eae1:4d91/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 6796sec preferred_lft 919sec
inet6 fe80::fcdd:1f00:9f6f:40be/64 scope link
valid_lft forever preferred_lft forever
3: wlan0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether e4:5f:01:29:43:bc brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:20:a3:f8:c1 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
How are you configuring ipv6 on your network? I suspect that all of the candidate ipv6 addresses are being eliminated due to their being flagged as dynamic: dynamic mngtmpaddr noprefixroute
You can also see that they have a fairly short lifetime.
All the addresses are handled by my router. That is not something that has changed recently, though.
Can I (should I) just use the link local address (fe80...)?
Oh, I just noticed that the address you've specified in your node-ip field is a Unique Local Address. Everything under FD00::/8 is ULA and may not be valid for what you're trying to to with it.
node-ip: 192.168.178.100,fd00::5038:7f13:6fea:cb62
Try using the 2003:f3:170d:8a00:c9ba:8cb4:eae1:4d91
address instead. That's within the space that would be assigned by an ISP.
Hmm, wouldn't that be subject to change at the ISPs discretion? K3s has worked with the fd00 address for more than a year previously.
I tried it anyway, but got the same error. Flannel still tries to use the ULA, even though I specified the 2003... in config.yaml:
INFO[0008] Using dual-stack mode. The ipv6 address fd00::5038:7f13:6fea:cb62 will be used by flannel
It looks like the error in question is coming from the network policy controller, not flannel. For some reason it's blowing up because the IPv6 address hasn't been set on the Kubernetes node object. That may be because the address you were trying to use was invalid, or it may be that it just hasn't been set yet.
You might try starting k3s with disable-network-policy: true
, and then do kubectl get node
. See if your node has an IPv6 address listed? Once it does, then you should be able to re-enable the network policy controller.
I'm a little confused why your node wouldn't have an IPv6 address set if you'd been running with dual-stack enabled for a while, though...
Okay, disabling network policy brought the node up, with either the fd00 or the 2003 address assigned (depending on how I configured it).
what is the output of kubectl get node -o yaml
? Is there an IPv6 address listed for the node? If there is, are you able to restart with the NPC enabled?
According to the annotations, it has an Ipv6 address, according to the status, it does not:
apiVersion: v1
kind: Node
metadata:
annotations:
alpha.kubernetes.io/provided-node-ip: 192.168.178.100
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"16:0b:d6:fa:ee:7f"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/backend-v6-data: '{"VNI":1,"VtepMAC":"56:c9:de:97:04:fd"}'
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.178.100
flannel.alpha.coreos.com/public-ipv6: fd00::5038:7f13:6fea:cb62
k3s.io/hostname: waveland
k3s.io/internal-ip: 192.168.178.100,fd00::5038:7f13:6fea:cb62
k3s.io/node-args: '["server","--write-kubeconfig-mode","644","--node-ip","192.168.178.100,fd00::5038:7f13:6fea:cb62","--cluster-cidr","10.42.0.0/16,2001:cafe:42:0::/56","--service-cidr","10.43.0.0/16,2001:cafe:42:1::/112","--flannel-ipv6-masq","true","--disable-network-policy","false"]'
k3s.io/node-config-hash: KPKNV6ISOCF5Y5URRUWO6C5IKSXNMN3O3IIJ2RGSKWBLTLFV5M5A====
k3s.io/node-env: '{"K3S_CONFIG_FILE":"/etc/rancher/k3s/config.yaml","K3S_DATA_DIR":"/var/lib/rancher/k3s/data/5646fe4613ed1cc8277ceffa5aed8260c68ff8a219648503438eddc2017a1962"}'
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2022-10-08T13:34:29Z"
finalizers:
- wrangler.cattle.io/node
labels:
beta.kubernetes.io/arch: arm64
beta.kubernetes.io/instance-type: k3s
beta.kubernetes.io/os: linux
kubernetes.io/arch: arm64
kubernetes.io/hostname: waveland
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
node-role.kubernetes.io/master: "true"
node.kubernetes.io/instance-type: k3s
name: waveland
resourceVersion: "3046721"
uid: 01c1b42a-9be3-4493-b35c-63ff3d2495d5
spec:
podCIDR: 10.42.0.0/24
podCIDRs:
- 10.42.0.0/24
- 2001:cafe:42::/64
providerID: k3s://waveland
status:
addresses:
- address: 192.168.178.100
type: InternalIP
- address: waveland
type: Hostname
But now it starts up even with disable-network-policy: false
I am very confused.
It definitely doesn't have an IPv6 address. The ingress is only reachable on IPv4, and I know for a fact that it used to be reachable on IPv6 as well (on the fd00... address).
Yeah, I'm confused as to why it's not adding the ipv6 address to the node status.
Do you know what version of k3s you were on previously, where it appeared to be working?
It was the previous release in the stable channel. I want to say 1.25.6
Today, I restarted the machine the node is on, and k3s again wouldn't start up. So I disabled network policy again in the config, and lo and behold, the node started with a IPv6 address.
status:
addresses:
- address: 192.168.178.100
type: InternalIP
- address: fd00::5038:7f13:6fea:cb62
type: InternalIP
- address: waveland
type: Hostname
I'm not sure what that means, but I guess I will leave network policy disabled for the time being. However, I am not at all sure if it will work the next time the machine needs a reboot.
Ahh, I bet that the IPv6 address isn't assigned immediately. You're getting an IPv4 address from the DHCP server immediately, and then K3s gets started, but the IPv6 address is assigned later, so the IPv6 address you requested isn't present initially and the network policy controller gets confused.
@rbrtbnfgl @thomasferrandiz the code in question here seems to have been added or refactored during some of the kube-router 2.0.0 release prep, I don't see this fatal error happening on earlier releases. Is this something we could work with upstream to address?
I'll take a look at it. However it seems right to give an error if the IP is not configured yet. EDIT: On the previous version there wasn't any check on the assigned Internal/External IP of the node https://github.com/k3s-io/kube-router/blob/v1.5.1%2Bk3s/pkg/controllers/netpol/network_policy_controller.go#L753 With the latest version the kube-router checks it. The issue is related that somehow the IPv6 as InternalIP on the node is not assigned. I tested using different IP configured as node-ip both IPv4 and IPv6 (addresses that wasn't configured on the node) and the InternalIPs were always assigned. The reason why the IPv6 is not configured is somehow related to this setup.
@aserraric I suspect that you're going to need to figure out how to solve this on your side. I'm not sure if you can ask systemd to wait on the network configuration step until both address families are assigned, instead of just proceeding on to start services as soon as you have an IPv4 address, or what. But the root cause seems to be that the network policy controller now correctly checks to see that your node has an IPv6 address assigned when using dual-stack, and your node frequently does not when starting K3s during system boot.
I'm not convinced that this is the reason. When I did my first round of testing, the system had been booted up for several hours and (as far as I can tell) had an IPv6 address assigned the entire time. Also, systemd is configured to keep trying to start k3s, so if your theory was true, it should eventually work.
I rather suspect that there may be a race condition between kubelet starting the node and assigning the IPs and the network policy controller checking for the IPs. Unfortunately, the network policy controller causes a bailout of the entire node when it doesn't find an IPv6 address, so I can't tell if the node had one or not since the apiserver is already gone. I'll try to do some more testing.
I could see a race condition on initial startup, but the kubelet should be atomic in updating the addresses when it comes up again after a reboot. It shouldn't remove only one of the two address families, especially if they've both been manually specified on the command line.
@rbrtbnfgl I do think we should raise this with upstream. This is the second bit of odd behavior I've seen since on the v2.0 branch.
But the issue is not kube-router related. I think is more related to the InternalIP not configured on the node.
Can you try to add this additional config kubelet-arg: "--node-ip=0.0.0.0"
I set my config.yaml to this:
write-kubeconfig-mode: 644
node-ip: 192.168.178.100,fd00::5038:7f13:6fea:cb62
cluster-cidr: 10.42.0.0/16,2001:cafe:42:0::/56
service-cidr: 10.43.0.0/16,2001:cafe:42:1::/112
flannel-ipv6-masq: true
disable-network-policy: false
kubelet-arg: "--node-ip=0.0.0.0"
and rebooted the machine. This led to an immediate startup with both IPv4 and IPv6 addresses assigned:
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
annotations:
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"9a:ed:30:f1:3f:f4"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/backend-v6-data: '{"VNI":1,"VtepMAC":"46:fc:fd:b2:2d:47"}'
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.178.100
flannel.alpha.coreos.com/public-ipv6: fd00::5038:7f13:6fea:cb62
k3s.io/hostname: waveland
k3s.io/internal-ip: 192.168.178.100,fd00::5038:7f13:6fea:cb62
k3s.io/node-args: '["server","--write-kubeconfig-mode","644","--node-ip","192.168.178.100,fd00::5038:7f13:6fea:cb62","--cluster-cidr","10.42.0.0/16,2001:cafe:42:0::/56","--service-cidr","10.43.0.0/16,2001:cafe:42:1::/112","--flannel-ipv6-masq","true","--disable-network-policy","false","--kubelet-arg","--node-ip=0.0.0.0"]'
k3s.io/node-config-hash: XNRDVN26TUE6TR6WQKXYSZTRTSKDEIXRGVIN2ESZ6TG547QVDYRA====
k3s.io/node-env: '{"K3S_CONFIG_FILE":"/etc/rancher/k3s/config.yaml","K3S_DATA_DIR":"/var/lib/rancher/k3s/data/5646fe4613ed1cc8277ceffa5aed8260c68ff8a219648503438eddc2017a1962"}'
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2022-10-08T13:34:29Z"
finalizers:
- wrangler.cattle.io/node
labels:
beta.kubernetes.io/arch: arm64
beta.kubernetes.io/instance-type: k3s
beta.kubernetes.io/os: linux
kubernetes.io/arch: arm64
kubernetes.io/hostname: waveland
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
node-role.kubernetes.io/master: "true"
node.kubernetes.io/instance-type: k3s
name: waveland
resourceVersion: "3170436"
uid: 01c1b42a-9be3-4493-b35c-63ff3d2495d5
spec:
podCIDR: 10.42.0.0/24
podCIDRs:
- 10.42.0.0/24
- 2001:cafe:42::/64
providerID: k3s://waveland
status:
addresses:
- address: 192.168.178.100
type: InternalIP
- address: fd00::5038:7f13:6fea:cb62
type: InternalIP
- address: waveland
type: Hostname
We already cover this for 1.24 and 1.25 in the docs, maybe we should suggest it for all versions. https://docs.k3s.io/installation/network-options#dual-stack-ipv4--ipv6-networking
I originally set up this cluster with 1.24, and never had this issue until I updated to 1.26. Does that mean 1.26 has changed the notion of what is considered the "primary network interface"? The device is question is a Raspberry Pi 4B, and end0 is the only ethernet interface on there. wifi is disabled.
I read that it was updated something on 1.26 and it will be fixed on 1.27 for sure.
Environmental Info: K3s Version: k3s version v1.26.3+k3s1 (01ea3ff2) go version go1.19.7
Node(s) CPU architecture, OS, and Version: Linux waveland 6.1.16-2-MANJARO-ARM-RPI #1 SMP PREEMPT Sat Mar 11 21:46:57 UTC 2023 aarch64 GNU/Linux
Cluster Configuration: single node cluster
Describe the bug: K3s fails to start in dual stack mode with message "IPv6 was enabled but no IPv6 address was found on node" even though the node has an IPv6 address.
Steps To Reproduce: Error occurred first after upgrading (in the stable channel) from 1.25 to 1.26. Downgrade back to 1.25 exhibits the same error, however. config.yaml:
Expected behavior: K3s starts in dual stack mode
Actual behavior: k3s does not start
Additional context / logs: