techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Pi Cluster won't finish install #478

Closed queso closed 7 months ago

queso commented 7 months ago

I have a PXE/NFS booted pi cluster, it looks like everything installs and it hangs up trying to check if the server nodes have joined.

Expected Behavior

Setup should finish

Current Behavior

Setup errors out:

FAILED - RETRYING: [10.0.10.103]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
fatal: [10.0.10.103]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.337718", "end": "2024-03-13 09:04:56.231084", "msg": "non-zero return code", "rc": 1, "start": "2024-03-13 09:04:55.893366", "stderr": "E0313 09:04:56.214169    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused\nE0313 09:04:56.216130    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused\nE0313 09:04:56.217650    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused\nE0313 09:04:56.219171    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused\nE0313 09:04:56.220830    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["E0313 09:04:56.214169    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused", "E0313 09:04:56.216130    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused", "E0313 09:04:56.217650    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused", "E0313 09:04:56.219171    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused", "E0313 09:04:56.220830    3672 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused", "The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

Steps to Reproduce

  1. Install Raspbian Bookworm
  2. Repeat 3x times
  3. Setup k3s-ansible playbook
  4. Run ansible-playbook site.yml -i inventory/valhalla/hosts.ini
  5. I have run the reset command in between setup runs.

Context (variables)

Operating system:

Raspbian Bookworm

Hardware:

Rapsberry PI 4 8gb with POE hats

Variables Used

all.yml

---
k3s_version: v1.29.2+k3s1
# this is the user that has ssh access to these machines
ansible_user: pi
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "America/New_York"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.0.10.2"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "testBengalHouse"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: "{{ ansible_facts[default(flannel_iface)]['ipv4']['address'] }}"

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  {{ '--flannel-iface=' + flannel_iface + '' }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
# the contents of the if block is also required if using calico or cilium
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik

extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.7.2"

# tag for kube-vip-cloud-provider manifest
# kube_vip_cloud_provider_tag_version: "main"

# kube-vip ip range for load balancer
# (uncomment to use kube-vip for services instead of MetalLB)
# kube_vip_lb_ip_range: "192.168.30.80-192.168.30.90"

# metallb type frr or native
metal_lb_type: "native"

# metallb mode layer2 or bgp
metal_lb_mode: "layer2"

# bgp options
# metal_lb_bgp_my_asn: "64513"
# metal_lb_bgp_peer_asn: "64512"
# metal_lb_bgp_peer_address: "192.168.30.1"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.14.3"
metal_lb_controller_tag_version: "v0.14.3"

# metallb ip range for load balancer
metal_lb_ip_range: "10.0.10.150-10.0.10.250"

Hosts

host.ini

[master]
10.0.10.101
10.0.10.102
10.0.10.103

[node]
10.0.10.104

# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43

[k3s_cluster:children]
master
node

Possible Solution

I did connect in and try to see what I could see with get nodes:

pi@valhalla1:~ $ sudo k3s kubectl get nodes
No resources found

I also ran k3s check-config and that came back clean for the box

queso commented 7 months ago

I can see during the install that it is adding namespaces:

Name:         default
Labels:       kubernetes.io/metadata.name=default
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-node-lease
Labels:       kubernetes.io/metadata.name=kube-node-lease
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-public
Labels:       kubernetes.io/metadata.name=kube-public
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-system
Labels:       kubernetes.io/metadata.name=kube-system
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         metallb-system
Labels:       kubernetes.io/metadata.name=metallb-system
              objectset.rio.cattle.io/hash=fc1016f2d449e33945c25d61c449a1c8b3278935
              pod-security.kubernetes.io/audit=privileged
              pod-security.kubernetes.io/enforce=privileged
              pod-security.kubernetes.io/warn=privileged
Annotations:  objectset.rio.cattle.io/applied:
                H4sIAAAAAAAA/4yQzU7DMBCEXwXN2QmkSUtjiQNnJI7cN/a2NXHsyN6mqqq+O0oREiBRerTmx/PtCTS6N07ZxQCNqYJC74KFxisNnEcyDIWBhSwJQZ9AIUQhcTHk+Rm7dzaSWcrkYm...
              objectset.rio.cattle.io/id:
              objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
              objectset.rio.cattle.io/owner-name: metallb-crds
              objectset.rio.cattle.io/owner-namespace: kube-system
Status:       Active

No resource quota.

No LimitRange resource.
queso commented 7 months ago

I managed to turn on the k3s-init logs and see a lot of this:

Mar 13 16:09:24 valhalla1 k3s[4049]: time="2024-03-13T16:09:24-04:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Mar 13 16:09:24 valhalla1 k3s[4049]: time="2024-03-13T16:09:24-04:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Mar 13 16:09:24 valhalla1 k3s[4049]: time="2024-03-13T16:09:24-04:00" level=error msg="Failed to record snapshots for cluster: nodes \"valhalla1\" not found"
Mar 13 16:09:24 valhalla1 k3s[4049]: time="2024-03-13T16:09:24-04:00" level=info msg="Waiting for control-plane node valhalla1 startup: nodes \"valhalla1\" not found"
Mar 13 16:09:24 valhalla1 k3s[4049]: {"level":"info","ts":"2024-03-13T16:09:24.867777-0400","caller":"traceutil/trace.go:171","msg":"trace[1138961132] transaction","detail":"{read_only:false; response_revision:1396; number_of_response:1; }","duration":"107.321117ms","start":"2024-03-13T16:09:24.760415-0400","end":"2024-03-13T16:09:24.867736-0400","steps":["trace[1138961132] 'process raft request'  (duration: 107.091305ms)"],"step_count":1}
Mar 13 16:09:25 valhalla1 k3s[4049]: W0313 16:09:25.131721    4049 handler_proxy.go:93] no RequestInfo found in the context
Mar 13 16:09:25 valhalla1 k3s[4049]: E0313 16:09:25.132333    4049 controller.go:113] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: Error, could not get list of group versions for APIService
Mar 13 16:09:25 valhalla1 k3s[4049]: I0313 16:09:25.132551    4049 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Mar 13 16:09:25 valhalla1 k3s[4049]: W0313 16:09:25.132385    4049 handler_proxy.go:93] no RequestInfo found in the context
Mar 13 16:09:25 valhalla1 k3s[4049]: E0313 16:09:25.133030    4049 controller.go:102] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
Mar 13 16:09:25 valhalla1 k3s[4049]: , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
Mar 13 16:09:25 valhalla1 k3s[4049]: I0313 16:09:25.133559    4049 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
queso commented 7 months ago

So it looks like it is related to NFS and how containerd works. The overlay stuff wasn't working.

I had to install fuse: sudo apt-get install fuse-overlayfs

and then I added this to my extra server args:

--snapshotter=fuse-overlayfs