techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Cluster fails to survive reboot - Invalid IP assigned to node #461

Closed Turtlez32 closed 8 months ago

Turtlez32 commented 8 months ago

I have 2 clusters Production - 6 node (3 etcd, 3 worker) Development - 3 node (3 etcd).

Production is cloud-init backed Ubuntu 22.04 machines with static ip's set in cloud init. When building the cluster the first time its online and runs without issues. When I get a power outage or reboot. all nodes come back online but cluster is not available.

On reboot of development my nodes get ip's 1 - 10.0.99.104/24 2 - 10.0.99.105/24 + 10.0.99.104/32 3 - 10.0.99.106/24

example ip a on the node which kills the cluster

ip a show eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:23:24:9a:99:c4 brd ff:ff:ff:ff:ff:ff
    altname enp0s25
    inet 10.0.99.105/24 brd 10.0.99.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet 10.0.99.104/32 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::223:24ff:fe9a:99c4/64 scope link
       valid_lft forever preferred_lft forever

Expected Behavior

Cluster starts up without issues

Current Behavior

Cluster is unable to start, restart required on all non first node to trigger IP move.

Steps to Reproduce

  1. Build a new k3s cluster based on the playbook
  2. Define static IP address for each node (Cloud-init/Netplan)
  3. Reboot cluster nodes
  4. check ip a (second node gets first nodes IP with /32 subnet

Context (variables)

Operating system: Ubuntu 22.04 | Debian 12 Hardware: Lenovo Tiny m900 (Production) | Lenovo Tiny M703 (Ubuntu 22.04 Server)

Variables Used

all.yml

k3s_version: v1.29.0+k3s1
systemd_dir: /etc/systemd/system
system_timezone: "Australia/Sydney"
flannel_iface: "eth0"
apiserver_endpoint: "10.0.99.104"
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

kube_vip_tag_version: "v0.6.4"
metal_lb_type: "native"
metal_lb_mode: "layer2"
metal_lb_speaker_tag_version: "v0.13.12"
metal_lb_controller_tag_version: "v0.13.12"
metal_lb_ip_range: "10.0.99.110-10.0.99.115"

Hosts

host.ini

[master]
10.0.99.104
10.0.99.105
10.0.99.106

[k3s_cluster:children]
master

Possible Solution

I am wondering if this is a cloud-init | proxmox | k3s issue. I am only seeing this issue on nodes 1/2 of my clusters. It started happening about 2 months ago when I was using Debian 12, saw there was a cloud-init bug about ip's. Switched to Ubuntu 22.04 and seeing the same issues.