techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Proxmox Cloud-init deploy, failing at "Copy vip manifest to first master" task. #361

Closed untraceablez closed 1 year ago

untraceablez commented 1 year ago

The issue is when running ansible-playbook site.yml The playbook runs up until the "copy vip manifest to first master" task, at which point it fails.

Expected Behavior

Playbook should run all the way through, setup an HA cluster running with 3 control nodes and 7 worker nodes.

Current Behavior

Ansible playbook will run just fine until the task "copy vip manifest to first master", then stop, showing a failure for all 3 master nodes, with all 7 worker nodes going through the playbook just fine.

Steps to Reproduce

  1. Fork the repo
  2. Changed inventory to match my local lan IPs for VMs
  3. Changed variables in all.yml to match local environment (including IPs since I'm on a 10.0.0./24 network)
  4. Adjusted ansible.cfg to inventory location, added private ssh-key.
  5. Removed raspberry-pi role from playbook

Context (variables)

Operating system:

Hypervisor: Proxmox VE 8 Ansible Controller OS: Ubuntu 22.04

Node OS: Ubuntu 22.04 based off jammy cloud init image.

Hardware:

Intel i9 13900 RAM 128 GB DDR5 5200MHz MSI Pro Series Z790 Motherboard 4 NVMe RAID-10 Array (4 x 1TB) 2 HDD MIRROR Array (2 x 6TB)

Variables Used

all.yml

k3s_version: "v1.25.12+k3s1"
ansible_user: ansible
systemd_dir: "/etc/systemd/system"

flannel_iface: "eth0"

apiserver_endpoint: "10.0.0.222"

k3s_token: "NA"

extra_server_args:

  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik

extra_agent_args: 

  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

kube_vip_tag_version: "v0.5.12"

metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"

metal_lb_ip_range: "10.0.0.80-10.0.0.90"

Hosts

host.ini

[k3s_cluster:children]
master
node

[master]
node01 ansible_host=10.0.0.178
node02 ansible_host=10.0.0.225
node03 ansible_host=10.0.0.47

[node]
node04 ansible_host=10.0.0.251
node05 ansible_host=10.0.0.142
node06 ansible_host=10.0.0.237
node07 ansible_host=10.0.0.137
node08 ansible_host=10.0.0.231
node09 ansible_host=10.0.0.118
node10 ansible_host=10.0.0.67

Possible Solution

k3s_ansible_vip_manifest_playbook_error.txt

untraceablez commented 1 year ago

I actually resolved this by just remaking the nodes from cloud-init and changing from node01, node02 etc naming scheme to control-node01... and worker-node01... Not sure why that made a difference, but it did! I suspect there's likely something else I did right this time that I just thought I'd done correctly previously.