techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

K3s Service stuck at activating on nodes #28

Closed Anguianolabs closed 2 years ago

Anguianolabs commented 2 years ago

k3s-node.service does not start and is stuck on activating.

Expected Behavior

k3s-node.server should be running on nodes.

Current Behavior

`TASK [k3s/master : Create crictl symlink] **changed: [us-ga-cluster-02] changed: [us-ga-cluster-01] changed: [us-ga-cluster-03]

PLAY [node] **** TASK [Gathering Facts] *****ok: [us-ga-worker-01] ok: [us-ga-worker-02]

TASK [k3s/node : Copy K3s service file] ****changed: [us-ga-worker-01] changed: [us-ga-worker-02]

TASK [k3s/node : Enable and check K3s service] *****`

`ubuntu@us-ga-worker-01:~$ sudo systemctl status k3s-node.service ● k3s-node.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s-node.service; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-05-20 15:49:17 EDT; 53s ago Docs: https://k3s.io Process: 2513 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 2522 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 2527 (k3s-agent) Tasks: 8 Memory: 31.2M CGroup: /system.slice/k3s-node.service └─2527 /usr/local/bin/k3s agent

May 20 15:49:17 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:17-04:00" level=info msg="Starting k3s agent v1.23.6+k3s1 (418c3fa8)" May 20 15:49:17 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:17-04:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.20.0.46:6443]" May 20 15:49:23 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:23-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51760->127.0.0.1:6444: read: connection reset by peer" May 20 15:49:29 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:29-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51768->127.0.0.1:6444: read: connection reset by peer" May 20 15:49:35 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:35-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51776->127.0.0.1:6444: read: connection reset by peer" May 20 15:49:42 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:42-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51784->127.0.0.1:6444: read: connection reset by peer" May 20 15:49:48 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:48-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51792->127.0.0.1:6444: read: connection reset by peer" May 20 15:49:54 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:54-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51800->127.0.0.1:6444: read: connection reset by peer" May 20 15:50:00 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:50:00-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51808->127.0.0.1:6444: read: connection reset by peer" May 20 15:50:06 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:50:06-04:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51816->127.0.0.1:6444: read: connection reset by peer"`

Steps to Reproduce

  1. ansible-playbook playbooks/site.yml -K

Context (variables)

Operating system: ubuntu 20.04 Hardware: Proxmox VM

Variables Used:

all.yml

k3s_version: v1.23.6+k3s1
# this is the user that has ssh access to these machines
ansible_user: ubuntu
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "America/New_York"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.20.0.46"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "xxxxxxxxxxxxxxxxx"

# change these to your liking, the only required one is--no-deploy servicelb
extra_server_args: "--no-deploy servicelb --no-deploy traefik"
extra_agent_args: ""

# image tag for kube-vip
kube_vip_tag_version: "v0.4.3"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

# metallb ip range for load balancers
metal_lb_ip_range: "10.20.0.100-10.20.0.120"

Hosts

host.ini

[master]
us-ga-cluster-01
us-ga-cluster-02
us-ga-cluster-03

[node]
us-ga-worker-01
us-ga-worker-02

[k3s_cluster:children]
master
node

Possible Solution

Anguianolabs commented 2 years ago

vip is not reachable so i am thinking if that's the issue.

vickyingle01 commented 2 years ago

I'm facing the same issue here, what if we only have one master? Do we need to specify the apiserver_endpoint same as the master IP? Kindly clarify.

timothystewart6 commented 2 years ago

Hi. This is an issue with k3s/kube-vip. I would use the version in the template. It will always be the latest tested version

v1.23.4+k3s1

vickyingle01 commented 2 years ago

Tried v1.23.4+k3s1, same issue. The VIP is not responding. Stuck on "TASK [k3s/node : Enable and check K3s service]"

Status on node = activating level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:42274->127.0.0.1:6444: read: connection reset by peer"

running the playbook with -vvv


TASK [k3s/node : Enable and check K3s service] ****************************************************************************************************************************************************************
task path: /home/ubuntu/k3s-ansible/roles/k3s/node/tasks/main.yml:11
Tuesday 24 May 2022  10:35:54 +0930 (0:00:00.672)       0:00:27.029 *********** 
Using module file /usr/local/lib/python3.8/dist-packages/ansible/modules/systemd.py
Pipelining is enabled.
<172.31.23.xx> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<172.31.23.xx> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o ControlPath=/home/ubuntu/.ansible/cp/50028a4cf5 172.31.23.xx '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-yqsrpeklwobgjpkvekflzdmhxideormf ; /usr/bin/python3'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
timothystewart6 commented 2 years ago

Please see this https://github.com/techno-tim/k3s-ansible/discussions/20