techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Kube Service: preparing server: failed to get CA certs #234

Closed bsodmike closed 1 year ago

bsodmike commented 1 year ago

Hi all,

I'm testing a very basic clone of this playbook, with a few basics changed. The error I'm seeing is this. It seems the Jinja templating is breaking at {.items[*].metadata.name} which is here https://github.com/techno-tim/k3s-ansible/blob/master/roles/k3s/master/tasks/main.yml#L34

TASK [k3s/master : Verify that all nodes actually joined (check k3s-init.service if this fails)] ***
FAILED - RETRYING: [10.0.3.79]: Verify that all nodes actually joined (check k3s-init.service if this fails) (20 retries left).
...
FAILED - RETRYING: [10.0.3.81]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
FAILED - RETRYING: [10.0.3.79]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
fatal: [10.0.3.81]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.104705", "end": "2023-02-16 13:44:24.915011", "msg": "non-zero return code", "rc": 1, "start": "2023-02-16 13:44:24.810306", "stderr": "The connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

fatal: [10.0.3.79]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.101943", "end": "2023-02-16 13:44:24.929091", "msg": "", "rc": 0, "start": "2023-02-16 13:44:24.827148", "stderr": "", "stderr_lines": [], "stdout": "k3s-1.debian11.homelab.com", "stdout_lines": ["k3s-1.debian11.homelab.com"]}

I can confirm that the kube-vip instance is running and the script fails due to the issue above.

bsodmike commented 1 year ago

Dug a bit deeper and the issue is elsewhere, this is on one of the master nodes:

Feb 16 14:40:11 k3s-3.debian11.homelab.com python3[22103]: ansible-ansible.legacy.command Invoked with _raw_params=k3s kubectl get nodes -l "node-role.kubernetes.io/master=true" -o=jsonpath="{.items[*].metadata.name}" _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Feb 16 14:40:11 k3s-3.debian11.homelab.com k3s[22038]: time="2023-02-16T14:40:11+05:30" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.0.3.79:6443/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 16 14:40:11 k3s-3.debian11.homelab.com systemd[1]: k3s-init.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 14:40:11 k3s-3.debian11.homelab.com systemd[1]: k3s-init.service: Failed with result 'exit-code'.
timothystewart6 commented 1 year ago

Hi can you please fill out the issue template that was supplied when you created an issue? Thank you!

bsodmike commented 1 year ago

Expected Behavior

According to the YouTube video, at least, your master nodes joined the main node which runs kube-vip.

Current Behavior

This does not happen, instead the 2nd and 3rd master nodes are unable to connect to the main (primary) master node as CA certs are missing.

Feb 16 14:40:11 k3s-3.debian11.homelab.com python3[22103]: ansible-ansible.legacy.command Invoked with _raw_params=k3s kubectl get nodes -l "node-role.kubernetes.io/master=true" -o=jsonpath="{.items[*].metadata.name}" _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Feb 16 14:40:11 k3s-3.debian11.homelab.com k3s[22038]: time="2023-02-16T14:40:11+05:30" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.0.3.79:6443/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 16 14:40:11 k3s-3.debian11.homelab.com systemd[1]: k3s-init.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 14:40:11 k3s-3.debian11.homelab.com systemd[1]: k3s-init.service: Failed with result 'exit-code'.

Steps to Reproduce

Run the playbook by default, this error should take place.

Context (variables)

Operating system: Debian 11

Hardware: VM: 16GB RAM / 2vcpu / 40GB disk

Variables Used

all.yml

k3s_version: v1.24.10+k3s1
ansible_user: NA
systemd_dir: /etc/systemd/system

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.0.3.85"

k3s_token: "NA"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.7"

# image tag for metal lb
metal_lb_frr_tag_version: "v7.5.1"
metal_lb_speaker_tag_version: "v0.13.7"
metal_lb_controller_tag_version: "v0.13.7"

# metallb ip range for load balancer
metal_lb_ip_range: "10.0.3.90-10.0.3.100"

Hosts

host.ini

[master]
10.0.3.79
10.0.3.80
10.0.3.81

[node]
10.0.3.82
10.0.3.83

# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43

[k3s_cluster:children]
master
node

Possible Solution

I was planning on setting up self-signed certs and seeing if that would work, but I'm just confused as why this wasn't experienced when you made the Video :). Thanks Tim!

Observations

FYI, I also noticed another error and fixed this by running /usr/local/bin/k3s kubectl create secret generic -n metallb-system memberlist --from-literal=secretkey="$(openssl rand -base64 128)" - without this, there were metallb errors in the logs.

bsodmike commented 1 year ago

If they do not match, create one master / server node and add additional servers outside of this playbook

Removing the 2nd/3rd master and trying this now. This passed the initial failure point

TASK [k3s/master : Verify that all nodes actually joined (check k3s-init.service if this fails)] ***
FAILED - RETRYING: [10.0.3.79]: Verify that all nodes actually joined (check k3s-init.service if this fails) (20 retries left).
ok: [10.0.3.79]

However it is now failing at

TASK [k3s/node : Copy K3s service file] **************************************************
changed: [10.0.3.83]
changed: [10.0.3.82]

TASK [k3s/node : Enable and check K3s service] *******************************************

I find it strange that it is trying to fetch the CA cert (which doesn't exist anyway, as far as I'm aware), from the localhost address - ideas?

Feb 17 11:27:25 k3s-4.debian11.homelab.com k3s[29019]: time="2023-02-17T11:27:25+05:30" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 17 11:27:27 k3s-4.debian11.homelab.com systemd[1]: Configuration file /etc/systemd/system/k3s-node.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 17 11:27:47 k3s-4.debian11.homelab.com k3s[29019]: time="2023-02-17T11:27:47+05:30" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 17 11:28:09 k3s-4.debian11.homelab.com k3s[29019]: time="2023-02-17T11:28:09+05:30" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
bornav commented 1 year ago

In my case had the same fail point, steps that helped me: make sure each host has a unique hostname, make sure that hosts do not have any firewall rules blocking traffic(on all ports)

bsodmike commented 1 year ago

Thanks @BornaV let me double check on the local firewall.