k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.1k stars 2.35k forks source link

K3s service does not start: failed to get CA certs #9343

Closed yannicschroeer closed 9 months ago

yannicschroeer commented 9 months ago

Environmental Info:

k3s version v1.28.6+k3s1 (39a00015) go version go1.20.13

Node(s) CPU architecture, OS, and Version:

openSUSE MicroOS , Linux 6.7.2-1-default

uname -a Linux k3s-control-plane-nbg1-lso 6.7.2-1-default #1 SMP PREEMPT_DYNAMIC Fri Jan 26 11:01:28 UTC 2024 (a52bf76) x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 2 servers, 3 agents

Describe the bug:

I'm having issues spinning up a K3s cluster using kube-hetzner I also opened an Issue in their repository (https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1201), as I am not certain, what the root cause is. After a fresh installation the nodes are unable to start/connect, as the ca certs are not exposed. Starting the server via

/usr/local/bin/k3s server

results in

INFO[0000] Starting k3s v1.28.6+k3s1 (39a00015) 
FATA[0000] starting kubernetes: preparing server: failed to get CA certs: Get "https://10.255.0.101:6443/cacerts": dial tcp 10.255.0.101:6443: connect: connection refused 

curling the cacerts endpoint does not work.

curl -vk https://10.255.0.101:6443/cacerts
*   Trying 10.255.0.101:6443...
* connect to 10.255.0.101 port 6443 from 10.254.0.101 port 42566 failed: Connection refused
* Failed to connect to 10.255.0.101 port 6443 after 3 ms: Couldn't connect to server
* Closing connection
curl: (7) Failed to connect to 10.255.0.101 port 6443 after 3 ms: Couldn't connect to server

The node is available in the network.

ping 10.255.0.101 -c 4

PING 10.255.0.101 (10.255.0.101) 56(84) bytes of data.
64 bytes from 10.255.0.101: icmp_seq=1 ttl=63 time=3.40 ms
64 bytes from 10.255.0.101: icmp_seq=2 ttl=63 time=2.94 ms
64 bytes from 10.255.0.101: icmp_seq=3 ttl=63 time=3.04 ms
64 bytes from 10.255.0.101: icmp_seq=4 ttl=63 time=2.88 ms

--- 10.255.0.101 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 2.882/3.065/3.402/0.201 ms

lsof -i shows that there are no k3s-related services running at all. The same issue comes up on all nodes, which is of course why the ca cert endpoints are not exposed anywhere. I am unaware of, if there are missing steps, as I am not able to backtrack the exact steps terraform/kube-hetzner does here.

Steps To Reproduce:

Create a MicroOS Snapshot

alias createkh='tmp_script=$(mktemp) && curl -sSL -o "${tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/create.sh && chmod +x "${tmp_script}" && "${tmp_script}" && rm "${tmp_script}"' 
createkh

kube.tf

locals {
  hcloud_token = local.credentials.hetznerCloudToken
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  enable_cert_manager = false

  ssh_public_key = file("~/.ssh/ssh.pub")
  ssh_private_key = file("~/.ssh/ssh")

control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "agent-large",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "egress",
      server_type = "cx21",
      location    = "fsn1",
      labels = [
        "node.kubernetes.io/role=egress"
      ],
      taints = [
        "node.kubernetes.io/role=egress:NoSchedule"
      ],
      floating_ip = true
      count = 1
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]
}

provider "hcloud" {
  token = local.hcloud_token
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Apply it.

terraform apply

Expected behavior:

The k3s service starts and the vm exposes its ca certs to the other nodes on :6443.

Actual behavior:

The creation will timeout after ~10 minutes because the temporary terraform script runs into a timeout

/tmp/terraform_*.sh

#!/bin/sh
systemctl start k3s 2> /dev/null
mkdir -p /var/post_install /var/user_kustomize
timeout 360 bash <<EOF
  until systemctl status k3s > /dev/null; do
    systemctl start k3s 2> /dev/null
    echo "Waiting for the k3s server to start..."
    sleep 3
  done
EOF

Additional context / logs:

journalctl -xeu k3s.service

░░ A start job for unit k3s.service has begun execution.
░░ 
░░ The job identifier is 736118.
Feb 02 08:45:29 k3s-control-plane-nbg1-lso sh[24464]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Feb 02 08:45:29 k3s-control-plane-nbg1-lso k3s[24471]: time="2024-02-02T08:45:29Z" level=info msg="Starting k3s v1.28.6+k3s1 (39a00015)"
Feb 02 08:45:29 k3s-control-plane-nbg1-lso k3s[24471]: time="2024-02-02T08:45:29Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.255.0.101:6443/cacerts\": dial tcp 10.255.0.101:6443: connect: connection refused"
Feb 02 08:45:29 k3s-control-plane-nbg1-lso systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit k3s.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Feb 02 08:45:29 k3s-control-plane-nbg1-lso systemd[1]: k3s.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit k3s.service has entered the 'failed' state with result 'exit-code'.
Feb 02 08:45:29 k3s-control-plane-nbg1-lso systemd[1]: Failed to start Lightweight Kubernetes.
░░ Subject: A start job for unit k3s.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit k3s.service has finished with a failure.
░░ 
░░ The job identifier is 736118 and the job result is failed.

/usr/local/bin/k3s server

INFO[0000] Starting k3s v1.28.6+k3s1 (39a00015)         
FATA[0000] starting kubernetes: preparing server: failed to get CA certs: Get "https://10.255.0.101:6443/cacerts": dial tcp 10.255.0.101:6443: connect: connection refused 

Installing k3s again on a node does not work either:

curl -sfL https://get.k3s.io 6 | INSTALL_K3S_VERSION=v1.28.6+k3s1 sh -s server

[INFO]  Using v1.28.6+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.28.6+k3s1/sha256sum-amd64.txt
[INFO]  Skipping binary downloaded, installed k3s matches hash
[INFO]  Finding available k3s-selinux versions
transactional-update 4.5.0 started
Options: --no-selfupdate -d run mkdir -p /var/lib/rpm-state
Separate /var detected.
2024-02-02 08:57:48 tukit 4.5.0 started
2024-02-02 08:57:48 Options: --discard -c3 open 
2024-02-02 08:57:48 Using snapshot 3 as base for new snapshot 4.
2024-02-02 08:57:48 /var/lib/overlay/3/etc
2024-02-02 08:57:48 Syncing /etc of previous snapshot 2 as base into new snapshot "/.snapshots/4/snapshot"
2024-02-02 08:57:48 SELinux is enabled.
Relabeled /var/lib/rancher/k3s from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent/containerd from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
Relabeled /var/lib/rancher/k3s/data from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:k3s_data_t:s0
ID: 4
2024-02-02 08:57:50 Transaction completed.
2024-02-02 08:57:50 tukit 4.5.0 started
2024-02-02 08:57:50 Options: --discard call 4 mkdir -p /var/lib/rpm-state 
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
2024-02-02 08:57:52 Executing `mkdir -p /var/lib/rpm-state`:
2024-02-02 08:57:52 Application returned with exit status 0.
2024-02-02 08:57:52 Transaction completed.
2024-02-02 08:57:52 tukit 4.5.0 started
2024-02-02 08:57:52 Options: --discard close 4 
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
2024-02-02 08:57:54 No changes to the root file system - discarding snapshot.
2024-02-02 08:57:54 Merging changes in /etc into the running system.
2024-02-02 08:57:54 Discarding snapshot 4.
2024-02-02 08:57:54 Transaction completed.
transactional-update finished
transactional-update 4.5.0 started
Options: --no-selfupdate -d run zypper --gpg-auto-import-keys install -y k3s-selinux
Separate /var detected.
2024-02-02 08:57:56 tukit 4.5.0 started
2024-02-02 08:57:56 Options: --discard -c3 open 
2024-02-02 08:57:56 Using snapshot 3 as base for new snapshot 4.
2024-02-02 08:57:56 /var/lib/overlay/3/etc
2024-02-02 08:57:56 Syncing /etc of previous snapshot 2 as base into new snapshot "/.snapshots/4/snapshot"
2024-02-02 08:57:56 SELinux is enabled.
Relabeled /var/lib/rancher/k3s from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent/containerd from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
Relabeled /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
Relabeled /var/lib/rancher/k3s/data from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:k3s_data_t:s0
ID: 4
2024-02-02 08:57:57 Transaction completed.
2024-02-02 08:57:57 tukit 4.5.0 started
2024-02-02 08:57:57 Options: --discard call 4 zypper --gpg-auto-import-keys install -y k3s-selinux 
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
2024-02-02 08:57:59 Executing `zypper --gpg-auto-import-keys install -y k3s-selinux`:
Loading repository data...
Reading installed packages...
'k3s-selinux' is already installed.
There is an update candidate for 'k3s-selinux', but it is locked. Use 'zypper removelock k3s-selinux' to unlock it.
There is an update candidate for 'k3s-selinux' from vendor 'openSUSE', while the current vendor is ''. Use 'zypper install k3s-selinux-1.4.stable.1-1.3.noarch' to install this candidate.
Resolving package dependencies...
Nothing to do.
2024-02-02 08:58:00 Application returned with exit status 0.
2024-02-02 08:58:01 Transaction completed.
2024-02-02 08:58:01 tukit 4.5.0 started
2024-02-02 08:58:01 Options: --discard close 4 
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
2024-02-02 08:58:02 No changes to the root file system - discarding snapshot.
2024-02-02 08:58:02 Merging changes in /etc into the running system.
2024-02-02 08:58:02 Discarding snapshot 4.
2024-02-02 08:58:03 Transaction completed.
transactional-update finished
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
░░ A start job for unit k3s.service has begun execution.
░░ 
░░ The job identifier is 754253.
Feb 02 09:00:50 k3s-control-plane-nbg1-lso sh[28860]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Feb 02 09:00:50 k3s-control-plane-nbg1-lso k3s[28867]: time="2024-02-02T09:00:50Z" level=info msg="Starting k3s v1.28.6+k3s1 (39a00015)"
Feb 02 09:00:50 k3s-control-plane-nbg1-lso k3s[28867]: time="2024-02-02T09:00:50Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.255.0.101:6443/cacerts\": dial tcp 10.255.0.101:6443: connect: connection refused"
Feb 02 09:00:50 k3s-control-plane-nbg1-lso systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit k3s.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Feb 02 09:00:50 k3s-control-plane-nbg1-lso systemd[1]: k3s.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit k3s.service has entered the 'failed' state with result 'exit-code'.
Feb 02 09:00:50 k3s-control-plane-nbg1-lso systemd[1]: Failed to start Lightweight Kubernetes.
░░ Subject: A start job for unit k3s.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit k3s.service has finished with a failure.
░░ 
░░ The job identifier is 754253 and the job result is failed.
brandond commented 9 months ago

What host is 10.255.0.101? The message indicates that the node you're running these commands on is configured to join existing K3s cluster that has that node as a server. If this is the first server in the cluster, you should be starting it without the address of an existing server to join in the config.

If you're going to use kube-hetzner then please refer to their docs; if you're going to try to do this by hand then look at https://docs.k3s.io/quick-start

yannicschroeer commented 9 months ago

What host is 10.255.0.101? The message indicates that the node you're running these commands on is configured to join existing K3s cluster that has that node as a server. If this is the first server in the cluster, you should be starting it without the address of an existing server to join in the config.

If you're going to use kube-hetzner then please refer to their docs; if you're going to try to do this by hand then look at https://docs.k3s.io/quick-start

Thats a good hint. 10.255.0.101 is also a new node. This is a completely new cluster, there is no cluster to join all the nodes are cross-referencing each other. I had a cluster up and running before, its more likely to be a kube-hetzner or terraform state issue then.

I will update this issue as soon as I find a solution - for future reference. I will most likely not bother you anymore. Thank you.

yannicschroeer commented 9 months ago

Positive. It was a terraform state issue. Apparently there were some weird things going on in the state even though I destroyed terraform destroy the resources. I had to delete the state (Remotely stored in S3 in my case) in order for it to work as expected.

Thanks for the hint again.