k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.02k stars 2.35k forks source link

Flannel Dualstack crash on 1.30.3 #10726

Closed ungarscool1 closed 1 month ago

ungarscool1 commented 2 months ago

Environmental Info: K3s Version: v1.30.3+k3s1 (f6466040)

Node(s) CPU architecture, OS, and Version:

  1. Linux REDACTED-server 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  2. Linux kube-1 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  3. Linux kube-2 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 server, 2 agents. Flannel over wireguard (but works fine on IPv4).

Describe the bug:

Aug 17 23:51:10 REDACTED-server k3s[1060103]: I0817 23:51:10.876617 1060103 kube.go:636] List of node(REDACTED-server) annotations: map[string]string{"alpha.kubernetes.io/provided-node-ip":"10.6.99.1,fdfe:a65f:20d::1", "csi.volume.kubernetes.io/nodeid":"{\"driver.longhorn.io\":\"REDACTED-server\"}", "etcd.k3s.cattle.io/local-snapshots-timestamp":"2024-08-17T23:16:35Z", "etcd.k3s.cattle.io/node-address":"10.6.99.1", "etcd.k3s.cattle.io/node-name":"REDACTED-server-89decbb8", "flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"f2:4b:ca:0b:ef:e2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/backend-v6-data":"{\"VNI\":1,\"VtepMAC\":\"2a:24:27:a4:a2:1b\"}", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"10.6.99.1", "flannel.alpha.coreos.com/public-ipv6":"fdfe:a65f:20d::1", "k3s.io/external-ip":"REDACTED-public-IPv4,REDACTED-public-IPv6", "k3s.io/hostname":"REDACTED-server", "k3s.io/internal-ip":"10.6.99.1,fdfe:a65f:20d::1", "k3s.io/node-args":"[\"server\",\"--write-kubeconfig-mode\",\"644\",\"--tls-san\",\"REDACTED-public-IPv4,REDACTED-public-IPv6\",\"--flannel-iface\",\"wg0\",\"--node-ip\",\"10.6.99.1,fdfe:a65f:20d::1\",\"--node-external-ip\",\"REDACTED-public-IPv4,REDACTED-public-IPv6\",\"--advertise-address\",\"10.6.99.1\",\"--cluster-cidr\",\"10.42.0.0/16,2001:cafe:42::/56\",\"--service-cidr\",\"10.43.0.0/16,2001:cafe:43::/112\",\"--flannel-ipv6-masq\",\"--cluster-init\"]", "k3s.io/node-config-hash":"REDACTED", "k3s.io/node-env":"{}", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
Aug 17 23:51:10 REDACTED-server k3s[1060103]: I0817 23:51:10.876871 1060103 vxlan.go:155] Interface flannel.1 mac address set to: f2:4b:ca:0b:ef:e2
Aug 17 23:51:10 REDACTED-server k3s[1060103]: I0817 23:51:10.878692 1060103 vxlan.go:183] Interface flannel-v6.1 mac address set to: 2a:24:27:a4:a2:1b
Aug 17 23:51:10 REDACTED-server k3s[1060103]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 17 23:51:10 REDACTED-server k3s[1060103]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x598197]
Aug 17 23:51:10 REDACTED-server k3s[1060103]: goroutine 27215 [running]:
Aug 17 23:51:10 REDACTED-server k3s[1060103]: math/big.(*Int).Bytes(0x0)
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /usr/local/go/src/math/big/int.go:527 +0x17
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/flannel-io/flannel/pkg/ip.(*IP6).ToIP(0x0)
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/pkg/mod/github.com/flannel-io/flannel@v0.25.4/pkg/ip/ip6net.go:82 +0x1c
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/flannel-io/flannel/pkg/ip.IP6Net.ToIPNet({0x0?, 0x0?})
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/pkg/mod/github.com/flannel-io/flannel@v0.25.4/pkg/ip/ip6net.go:175 +0x25
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/flannel-io/flannel/pkg/ip.EnsureV6AddressOnLink({0x0?, 0xc0168920d8?}, {0xc01d400700?, 0x30?}, {0x71c2e30, 0xc015f86c40})
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/pkg/mod/github.com/flannel-io/flannel@v0.25.4/pkg/ip/iface.go:298 +0x52
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/flannel-io/flannel/pkg/backend/vxlan.(*vxlanDevice).ConfigureIPv6(0xc0073391f0, {0x0?, 0xc001358550?}, {0xc01d400700?, 0x10?})
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/pkg/mod/github.com/flannel-io/flannel@v0.25.4/pkg/backend/vxlan/device.go:153 +0x50
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/flannel-io/flannel/pkg/backend/vxlan.(*VXLANBackend).RegisterNetwork(0xc00f644a68, {0x71f93c0, 0xc001358550}, 0xc001358550?, 0xc01cf00b00)
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/pkg/mod/github.com/flannel-io/flannel@v0.25.4/pkg/backend/vxlan/vxlan.go:228 +0xd25
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/k3s-io/k3s/pkg/agent/flannel.flannel({0x71f93c0, 0xc001358550}, 0xc023979fd0?, {0xc007415a40, 0x34}, {0xc007a48cf0, 0x2d}, 0x1, 0xb)
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/src/github.com/k3s-io/k3s/pkg/agent/flannel/flannel.go:82 +0x222
Aug 17 23:51:10 REDACTED-server k3s[1060103]: github.com/k3s-io/k3s/pkg/agent/flannel.Run.func1()
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/src/github.com/k3s-io/k3s/pkg/agent/flannel/setup.go:78 +0x46
Aug 17 23:51:10 REDACTED-server k3s[1060103]: created by github.com/k3s-io/k3s/pkg/agent/flannel.Run in goroutine 1
Aug 17 23:51:10 REDACTED-server k3s[1060103]:         /go/src/github.com/k3s-io/k3s/pkg/agent/flannel/setup.go:77 +0x152
Aug 17 23:51:11 REDACTED-server systemd[1]: k3s.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Steps To Reproduce: Have this configuration:

/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--tls-san' \
        'REDACTED-public-IPv4,REDACTED-public-IPv6' \
        '--flannel-iface' \
        'wg0' \
        '--node-ip' \
        '10.6.99.1,fdfe:a65f:20d::1' \
        '--node-external-ip' \
        'REDACTED-public-IPv4,REDACTED-public-IPv6' \
        '--advertise-address' \
        '10.6.99.1' \
        '--cluster-cidr' \
        '10.42.0.0/16,2001:cafe:42::/56' \
        '--service-cidr' \
        '10.43.0.0/16,2001:cafe:43::/112' \
        '--flannel-ipv6-masq' \
        '--cluster-init' \

Wireguard configuration:

interface: wg0
  public key: REDACTED
  private key: (hidden)
  listening port: 51821

peer: REDACTED
  preshared key: (hidden)
  endpoint: REDACTED:20318
  allowed ips: 10.6.99.2/32, fdfe:a65f:20d::2/128
  latest handshake: 23 seconds ago
  transfer: 673.30 MiB received, 1.56 GiB sent

peer: REDACTED
  preshared key: (hidden)
  endpoint: REDACTED:37999
  allowed ips: 10.6.99.3/32, fdfe:a65f:20d::3/128
  latest handshake: 1 minute, 48 seconds ago
  transfer: 1.63 GiB received, 1.14 GiB sent
brandond commented 2 months ago

Just to be clear, this happened when trying to add IPv6 and etcd to a cluster that was started with sqlite and only IPv4?

ungarscool1 commented 2 months ago

My cluster was started from SQLite and IPv4. I switched from SQLite to etcd like 2 months ago. Now, I am trying to add IPv6, to enable traefik on IPv4/6. However, I just read on the documentation that I can't do dualstack because I started my cluster with IPv4 only. So, do I really need to destroy my cluster?

hofq commented 2 months ago

same issue here - tried switching from ipv4 to dualstack. Running single node

brandond commented 2 months ago

You can try deleting the node via kubectl delete node before restarting it as dual-stack.

The issue is that Kubernetes only assigns pod CIDRs to nodes when the node resource is created. If you try to switch from single-stack to dual-stack after the nodes have already joined the cluster, it won't add an IPv6 pod CIDR.

hofq commented 2 months ago

Looking Good! Thank you really much. Maybe we can add error handling for this?

brandond commented 2 months ago

We don't technically support changing CIDRs or other core bits of CNI config after the cluster is up, and we don't want to be in the business of deleting nodes for people... but yes Flannel could probably be fixed to not crash.

ungarscool1 commented 2 months ago

Hi @brandond

You can try deleting the node via kubectl delete node before restarting it as dual-stack.

Even agents nodes or just the server?

brandond commented 2 months ago

Even agents nodes or just the server?

All the nodes that you want to be dual-stack. As I said, they need to be deleted and recreated to get new CIDRs assigned.

ungarscool1 commented 1 month ago

Hello, I tried your solution, but I had an issue with Longhorn. None of my nodes were re-poping independently, and the K3s server crashed due to too high events. So I rolled back, but thank you. As longhorn is not in this perimeter I close the issue.