techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Will there be support for setting cluster on different servers, with different users and passwords? #357

Closed ciobania closed 1 year ago

ciobania commented 1 year ago

Discussed in https://github.com/techno-tim/k3s-ansible/discussions/356

Originally posted by **ciobania** September 2, 2023 Hey, I've been following you for some time. Thanks for everything! I'm new to Ansible, and it looks pretty cool what can be achieved with it! I've tried using this repo to set up my cluster, formed of 4 RPi 4s, and some other machines that I have lying around, doing various things. I know in the README it says it only works with passwordless machines. I want to configure it such that 3 of the RPis will be Masters, and the rest of the machines, including the remaining RPi, will be nodes. For obvious reasons, some of my machines have different users and different passwords, which means the `group_vars/all.yml` needs to be tweaked. I know I can modify `hosts.ini` to include `ansible_user` for each host, as well as include a vault. I have already done that, but I'm not really sure how to change the existing roles/playbooks to take advantage of the vault and the designated `ansible_user` My vault file looks like this: ``` hosts_passwords: s4master1: "server1_password" s4master2: "server2_password" s4master3: "server3_password" s4worker1: "server4_password" ``` My `hosts.ini` looks like this: ``` [master] s4master1-prod ansible_host=192.168.2.10 ansible_user=s4master1 ansible_become_pass="{{ host_passwords['s4master1'] }}" s4master2-prod ansible_host=192.168.2.11 ansible_user=s4master2 ansible_become_pass="{{ host_passwords['s4master2'] }}" s4master3-prod ansible_host=192.168.2.12 ansible_user=s4master3 ansible_become_pass="{{ host_passwords['s4master3'] }}" [node] s4worker1-prod ansible_host=192.168.2.13 ansible_user=s4worker1 ansible_become_pass="{{ host_passwords['s4worker1'] }}" #192.168.2.20 - thespis #192.168.2.21 - anacreon #192.168.1.12 - terminus #192.168.1.52 - vm_discovery #192.168.1.53 - vm_sphere # only required if proxmox_lxc_configure: true # must contain all proxmox instances that have a master or worker node # [proxmox] # 192.168.30.43 [k3s_cluster:children] master node ``` Can you please tell me which files I can modify, to make this work with the specifics above? I don't understand how the roles work, and I'm being lazy here with this (wife and kids need don't allow for much learning, trial and error, tinkering time)

Currently I get the following error, that I don't understand how to fix, because I've already provided the vault in the site.yml:

fatal: [s4worker1-prod]: FAILED! => {"msg": "The field 'become_pass' has an invalid value, which includes an undefined variable. The error was: 'host_passwords' is undefined. 'host_passwords' is undefined. 'host_passwords' is undefined. 'host_passwords' is undefined"}
ciobania commented 1 year ago

I managed to figure out where I did wrong, but now I face another issue with the RPis. Understandably, some might have Ubuntu on them, but I have ubuntu-server version, and I don't have the /boot/firmware folder.

Error is:

fatal: [s4master3-prod]: FAILED! => {"changed": false, "msg": "Destination /boot/firmware/cmdline.txt does not exist !", "rc": 257}
ciobania commented 1 year ago

Somehow I managed to get it working. The playbook finishes, but the master and nodes die. Upon reboot, the k3s service is not even there.

s4master3@s4master3-prod:~$ sudo systemctl status k3s.service
Unit k3s.service could not be found.

Among all the errors I was able to trace, there's one that doesn't make much sense:

{"level":"warn","ts":"2023-09-02T20:26:32.751079+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0234] Failed to check local etcd status for learner management: context deadline exceeded 
INFO[0234] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error 
{"level":"warn","ts":"2023-09-02T20:27:12.45887+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}
{"level":"info","ts":"2023-09-02T20:27:12.459011+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
INFO[0274] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error 
{"level":"warn","ts":"2023-09-02T20:27:14.134933+0100","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000687880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}

I've tried to run it after a reset.yml and still no luck.

ciobania commented 1 year ago

After another round of cleaning, and re-running the playbook, I get the following errors, which I did not have before:

changed: [s4master1-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--cluster-init", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.10", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.072457", "end": "2023-09-02 21:41:34.325039", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.252582", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
changed: [s4master2-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--server", "https://192.168.2.10:6443", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.11", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.062750", "end": "2023-09-02 21:41:34.363300", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.300550", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
changed: [s4master3-prod] => {"changed": true, "cmd": ["systemd-run", "-p", "RestartSec=2", "-p", "Restart=on-failure", "--unit=k3s", "k3s", "server", "--server", "https://192.168.2.10:6443", "--token", "some-SUPER-DEDEUPER-secret-password", "--flannel-iface=eth0", "--node-ip=192.168.2.12", "--node-taint", "node-role.kubernetes.io/master=true:NoSchedule", "--tls-san", "192.168.2.9", "--disable", "servicelb", "#", "--disable", "traefik"], "delta": "0:00:00.065642", "end": "2023-09-02 21:41:34.394289", "msg": "", "rc": 0, "start": "2023-09-02 21:41:34.328647", "stderr": "Running as unit: k3s.service", "stderr_lines": ["Running as unit: k3s.service"], "stdout": "", "stdout_lines": []}
ciobania commented 1 year ago

Not sure how kube-vip is supposed to work, but it looks like the group_vars/all is not referenced in the playbooks, just in molecule?

The deployment fails when checking if the nodes actually joined.

fatal: [s4master3-prod]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.412294", "end": "2023-09-02 22:45:58.487996", "msg": "non-zero return code", "rc": 1, "start": "2023-09-02 22:45:58.075702", "stderr": "E0902 22:45:58.467309    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.468304    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.470185    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.472113    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nE0902 22:45:58.473950    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused\nThe connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["E0902 22:45:58.467309    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.468304    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.470185    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.472113    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "E0902 22:45:58.473950    8742 memcache.go:265] couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused", "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
fatal: [s4master1-prod]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:01.130585", "end": "2023-09-02 22:46:08.018981", "msg": "", "rc": 0, "start": "2023-09-02 22:46:06.888396", "stderr": "", "stderr_lines": [], "stdout": "s4master1-prod", "stdout_lines": ["s4master1-prod"]}

Interrupting that check seems to put the nodes in a slightly better position, and at least s4master1 is:

kubectl get nodes

NAME             STATUS   ROLES                       AGE   VERSION
s4master1-prod   Ready    control-plane,etcd,master   26m   v1.28.1-rc2+k3s1
ciobania commented 1 year ago

I had to wipe out all the RPis, and start with a fresh install, and now it works. Thank you