techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Debian Bookworm moves cmdline.txt from /boot to /boot/firmware #457

Closed pbolduc closed 8 months ago

pbolduc commented 8 months ago

In the Raspberry Pi Documentation, it states since Bookworm the boot partition https://www.raspberrypi.com/documentation/computers/config_txt.html has been moved from /boot to /boot/firmware/. The Activating cgroup support needs to change the path based on the distribution of Raspbian/Raspberry Pi OS.

With this broken, cgroups are not configured correctly and k3s will not start.

Expected Behavior

When installing on Bookworm, the config.txt file should be updated correctly to enable cgroups.

Current Behavior

During setup, the master/control nodes cannot start k3s server because the cgroups are not configured. The task completes but is modifying the wrong file.

TASK [raspberrypi : Activating cgroup support] ************************************************************************************************************************************************************
changed: [192.168.1.153]
changed: [192.168.1.151]
changed: [192.168.1.154]
changed: [192.168.1.150]
changed: [192.168.1.152]
changed: [192.168.1.155]
changed: [192.168.1.157]
changed: [192.168.1.156]
changed: [192.168.1.158]

Then on the verify task it times out and errors,

TASK [k3s_server : Verify that all nodes actually joined (check k3s-init.service if this fails)] **********************************************************************************************************
FAILED - RETRYING: [192.168.1.150]: Verify that all nodes actually joined (check k3s-init.service if this fails) (20 retries left).
...
FAILED - RETRYING: [192.168.1.150]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
fatal: [192.168.1.150]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.273622", "end": "2024-02-25 18:10:42.404607", "msg": "non-zero return code", "rc": 1, "start": "2024-02-25 18:10:42.130985", "stderr": "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

Checking the logs with journalctl

Feb 25 18:08:28 control-1 k3s[2271]: time="2024-02-25T18:08:28-08:00" level=fatal msg="failed to find memory cgroup (v2)"
Feb 25 18:08:28 control-1 systemd[1]: k3s-init.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 18:08:28 control-1 systemd[1]: k3s-init.service: Failed with result 'exit-code'.

Steps to Reproduce

  1. Install fresh Bookworm based Raspberry Pi OS on a Pi
  2. Install k3s using ansible-playbook site.yml

Context (variables)

Operating system:

Hardware:

9 Raspberry Pi 4s

Variables Used

all.yml

---
k3s_version: v1.29.1+k3s2
# this is the user that has ssh access to these machines
ansible_user: pico-k3s
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "America/Vancouver"

# interface which will be used for flannel
flannel_iface: "eth0"

# uncomment calico_iface to use tigera operator/calico cni instead of flannel https://docs.tigera.io/calico/latest/about
# calico_iface: "eth0"
calico_ebpf: false           # use eBPF dataplane instead of iptables
calico_tag: "v3.27.0"        # calico version tag

# uncomment cilium_iface to use cilium cni instead of flannel or calico
# ensure v4.19.57, v5.1.16, v5.2.0 or more recent kernel
#cilium_iface: "eth0"
cilium_mode: "native"        # native when nodes on same subnet or using bgp, else set routed
cilium_tag: "v1.15.1"        # cilium version tag
cilium_hubble: true          # enable hubble observability relay and ui

# if using calico or cilium, you may specify the cluster pod cidr pool
cluster_cidr: "10.52.0.0/16"

# enable cilium bgp control plane for lb services and pod cidrs. disables metallb.
cilium_bgp: true

# bgp parameters for cilium cni. only active when cilium_iface is defined and cilium_bgp is true.
cilium_bgp_my_asn: "64513"
cilium_bgp_peer_asn: "64512"
cilium_bgp_peer_address: "192.168.1.1"
cilium_bgp_lb_cidr: "192.168.10.0/24"   # cidr for cilium loadbalancer ipam

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.1.250"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "damn-i-should-have-changed-this-oh-well-it-is-just-a-home-lab-cluster-i-can-nuke"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: "{{ ansible_facts[(cilium_iface | default(calico_iface | default(flannel_iface)))]['ipv4']['address'] }}"

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  {{ '--flannel-iface=' + flannel_iface if calico_iface is not defined and cilium_iface is not defined else '' }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
# the contents of the if block is also required if using calico or cilium
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  {% if calico_iface is defined or cilium_iface is defined %}
  --flannel-backend=none
  --disable-network-policy
  --cluster-cidr={{ cluster_cidr | default('10.52.0.0/16') }}
  {% endif %}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik

extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.6.4"

# tag for kube-vip-cloud-provider manifest
# kube_vip_cloud_provider_tag_version: "main"

# kube-vip ip range for load balancer
# (uncomment to use kube-vip for services instead of MetalLB)
# kube_vip_lb_ip_range: "192.168.30.80-192.168.30.90"

# metallb type frr or native
metal_lb_type: "native"

# metallb mode layer2 or bgp
metal_lb_mode: "layer2"

# bgp options
# metal_lb_bgp_my_asn: "64513"
# metal_lb_bgp_peer_asn: "64512"
# metal_lb_bgp_peer_address: "192.168.30.1"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.12"
metal_lb_controller_tag_version: "v0.13.12"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.30.80-192.168.30.90"

# Only enable if your nodes are proxmox LXC nodes, make sure to configure your proxmox nodes
# in your hosts.ini file.
# Please read https://gist.github.com/triangletodd/02f595cd4c0dc9aac5f7763ca2264185 before using this.
# Most notably, your containers must be privileged, and must not have nesting set to true.
# Please note this script disables most of the security of lxc containers, with the trade off being that lxc
# containers are significantly more resource efficient compared to full VMs.
# Mixing and matching VMs and lxc containers is not supported, ymmv if you want to do this.
# I would only really recommend using this if you have particularly low powered proxmox nodes where the overhead of
# VMs would use a significant portion of your available resources.
proxmox_lxc_configure: false
# the user that you would use to ssh into the host, for example if you run ssh some-user@my-proxmox-host,
# set this value to some-user
proxmox_lxc_ssh_user: root
# the unique proxmox ids for all of the containers in the cluster, both worker and master nodes
proxmox_lxc_ct_ids:
  - 200
  - 201
  - 202
  - 203
  - 204

# Only enable this if you have set up your own container registry to act as a mirror / pull-through cache
# (harbor / nexus / docker's official registry / etc).
# Can be beneficial for larger dev/test environments (for example if you're getting rate limited by docker hub),
# or air-gapped environments where your nodes don't have internet access after the initial setup
# (which is still needed for downloading the k3s binary and such).
# k3s's documentation about private registries here: https://docs.k3s.io/installation/private-registry
custom_registries: false
# The registries can be authenticated or anonymous, depending on your registry server configuration.
# If they allow anonymous access, simply remove the following bit from custom_registries_yaml
#   configs:
#     "registry.domain.com":
#       auth:
#         username: yourusername
#         password: yourpassword
# The following is an example that pulls all images used in this playbook through your private registries.
# It also allows you to pull your own images from your private registry, without having to use imagePullSecrets
# in your deployments.
# If all you need is your own images and you don't care about caching the docker/quay/ghcr.io images,
# you can just remove those from the mirrors: section.
custom_registries_yaml: |
  mirrors:
    docker.io:
      endpoint:
        - "https://registry.domain.com/v2/dockerhub"
    quay.io:
      endpoint:
        - "https://registry.domain.com/v2/quayio"
    ghcr.io:
      endpoint:
        - "https://registry.domain.com/v2/ghcrio"
    registry.domain.com:
      endpoint:
        - "https://registry.domain.com"

  configs:
    "registry.domain.com":
      auth:
        username: yourusername
        password: yourpassword

# Only enable and configure these if you access the internet through a proxy
# proxy_env:
#   HTTP_PROXY: "http://proxy.domain.local:3128"
#   HTTPS_PROXY: "http://proxy.domain.local:3128"
#   NO_PROXY: "*.domain.local,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"

Hosts

host.ini

[master]
192.168.1.150 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa

[node]
192.168.1.151 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.152 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.153 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.154 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.155 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.156 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.157 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa
192.168.1.158 ansible_ssh_private_key_file=~/.ssh/pico-ecdsa

# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43

[k3s_cluster:children]
master
node

Possible Solution

In this file is where the path to config.txt is specified. I suspect in main.yml when Bookworm is detected, some variable or something needs to change the path.

yebo29 commented 8 months ago

https://github.com/techno-tim/k3s-ansible/pull/456 is meant to handle this. Awaiting feedback on failing molecule test.

timothystewart6 commented 8 months ago

closed by https://github.com/techno-tim/k3s-ansible/pull/456