Closed ciobania closed 1 year ago
I managed to figure out where I did wrong, but now I face another issue with the RPis. Understandably, some might have Ubuntu on them, but I have ubuntu-server version, and I don't have the /boot/firmware
folder.
Error is:
fatal: [s4master3-prod]: FAILED! => {"changed": false, "msg": "Destination /boot/firmware/cmdline.txt does not exist !", "rc": 257}
Somehow I managed to get it working. The playbook finishes, but the master and nodes die. Upon reboot, the k3s
service is not even there.
s4master3@s4master3-prod:~$ sudo systemctl status k3s.service
Unit k3s.service could not be found.
Among all the errors I was able to trace, there's one that doesn't make much sense:
{"level":"warn","ts":"2023-09-02T20:26:32.751079+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0234] Failed to check local etcd status for learner management: context deadline exceeded
INFO[0234] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error
{"level":"warn","ts":"2023-09-02T20:27:12.45887+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}
{"level":"info","ts":"2023-09-02T20:27:12.459011+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
INFO[0274] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error
{"level":"warn","ts":"2023-09-02T20:27:14.134933+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}
I've tried to run it after a reset.yml
and still no luck.
After another round of cleaning, and re-running the playbook, I get the following errors, which I did not have before:
changed: [s4master1-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--cluster-init", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.10", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.072457", "end": "2023-09-02 21:41:34.325039", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.252582", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
changed: [s4master2-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--server", "https://192.168.2.10:6443", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.11", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.062750", "end": "2023-09-02 21:41:34.363300", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.300550", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
changed: [s4master3-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--server", "https://192.168.2.10:6443", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.12", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.065642", "end": "2023-09-02 21:41:34.394289", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.328647", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
Not sure how kube-vip is supposed to work, but it looks like the group_vars/all
is not referenced in the playbooks, just in molecule
?
The deployment fails when checking if the nodes actually joined.
fatal: [s4master3-prod]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.412294", "end": "2023-09-02 22:45:58.487996", "msg": "non-zero return code", "rc": 1, "start": "2023-09-02 22:45:58.075702", "stderr": "E0902 22:45:58.467309 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.468304 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.470185 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.472113 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.473950 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nThe connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["E0902 22:45:58.467309 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.468304 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.470185 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.472113 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.473950 8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
fatal: [s4master1-prod]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:01.130585", "end": "2023-09-02 22:46:08.018981", "msg": "", "rc": 0, "start": "2023-09-02 22:46:06.888396", "stderr": "", "stderr_lines": [], "stdout": "s4master1-prod", "stdout_lines": ["s4master1-prod"]}
Interrupting that check seems to put the nodes in a slightly better position, and at least s4master1
is:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
s4master1-prod Ready control-plane,etcd,master 26m v1.28.1-rc2+k3s1
I had to wipe out all the RPis, and start with a fresh install, and now it works. Thank you
Discussed in https://github.com/techno-tim/k3s-ansible/discussions/356
Currently I get the following error, that I don't understand how to fix, because I've already provided the vault in the
site.yml
: