kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.42k stars 371 forks source link

[Bug]: "waiting for the k3s server to start" #1148

Closed janhaa closed 10 months ago

janhaa commented 10 months ago

Description

EDIT: This is actually a duplicate, see: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1145#issuecomment-1875459438

Provisioning the servers using terraform apply does not work unfortunately:

module.kube-hetzner.null_resource.first_control_plane: Still creating... [2m20s elapsed]
module.kube-hetzner.null_resource.first_control_plane (remote-exec): Job for k3s.service failed because the control process exited with error code.
module.kube-hetzner.null_resource.first_control_plane (remote-exec): See "systemctl status k3s.service" and "journalctl -xeu k3s.service" for details.
module.kube-hetzner.null_resource.first_control_plane (remote-exec): Waiting for the k3s server to start...
╷
│ Error: remote-exec provisioner error
│
│   with module.kube-hetzner.null_resource.first_control_plane,
│   on .terraform/modules/kube-hetzner/init.tf line 73, in resource "null_resource" "first_control_plane":
│   73:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_2100530671.sh": Process exited with status 124

Investigating the control planes journal yields:

Jan 04 20:10:19 k3s-control-plane-ads systemd[1]: Starting Lightweight Kubernetes...
Jan 04 20:10:19 k3s-control-plane-ads sh[1982]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jan 04 20:10:19 k3s-control-plane-ads (k3s)[1988]: k3s.service: Failed to locate executable /usr/local/bin/k3s: Permission denied
Jan 04 20:10:19 k3s-control-plane-ads (k3s)[1988]: k3s.service: Failed at step EXEC spawning /usr/local/bin/k3s: Permission denied
Jan 04 20:10:19 k3s-control-plane-ads systemd[1]: k3s.service: Main process exited, code=exited, status=203/EXEC
Jan 04 20:10:19 k3s-control-plane-ads systemd[1]: k3s.service: Failed with result 'exit-code'.

Although:

k3s-control-plane-ads:~ # stat -c "%U %G" /usr/local/bin/k3s
root root

Manual run works fine:

k3s-control-plane-ads:~ # /usr/local/bin/k3s server
INFO[0000] Starting k3s v1.28.5+k3s1 (5b2d1271)
INFO[0000] Managed etcd cluster initializing
...

Thank you alot for your efforts!

Kube.tf file

I only modified the nodepool settings:  

control_plane_nodepools = [
    {
      name        = "control-plane",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
      # swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      # zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      # kubelet_args = ["kube-reserved=cpu=250m,memory=1500Mi,ephemeral-storage=1Gi", "system-reserved=cpu=250m,memory=300Mi"]

      # Enable automatic backups via Hetzner (default: false)
      # backups = true
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-medium",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 2
      # swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      # zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      # kubelet_args = ["kube-reserved=cpu=50m,memory=300Mi,ephemeral-storage=1Gi", "system-reserved=cpu=250m,memory=300Mi"]

      # Enable automatic backups via Hetzner (default: false)
      # backups = true
    }
  ]

Screenshots

No response

Platform

WSL

janhaa commented 10 months ago

Some digging with the help of almighty ChatGPT revealed an issue related to SELinux.

k3s-control-plane-1-myr:~ # sudo ausearch -m AVC -ts recent | grep k3s
type=AVC msg=audit(1704401173.178:542): avc:  denied  { execute } for  pid=2234 comm="(k3s)" name="k3s" dev="sda3" ino=279 scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
type=AVC msg=audit(1704401178.471:544): avc:  denied  { execute } for  pid=2251 comm="(k3s)" name="k3s" dev="sda3" ino=279 scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
type=AVC msg=audit(1704401183.721:546): avc:  denied  { execute } for  pid=2264 comm="(k3s)" name="k3s" dev="sda3" ino=279 scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
...

Running sudo restorecon -v /usr/local/bin/k3s allowed me to get past the issue on this control plane...

janhaa commented 10 months ago

After running sudo restorecon -v /usr/local/bin/k3s on all machines deployment works!

Wayneoween commented 10 months ago

I'm observing the same issue. Fixing this once might be fine but I presume the issue will come up if there is an automated upgrade of a node?

CroutonDigital commented 10 months ago

Today 2 k3s nodes got status not Ready, reboot not helped. I made rollback system snaphot to 1 day ago use snapper rollback. After start k3s node comeback to status Ready.

rebuild Suse MicroOs and try add new k3s node, but not success with same errors:

module.kube-hetzner.null_resource.agents["2-2-bots-large"]: Still creating... [2m10s elapsed]
module.kube-hetzner.null_resource.agents["2-2-bots-large"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["2-2-bots-large"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["2-2-bots-large"]: Still creating... [2m20s elapsed]
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.agents["2-2-bots-large"],
│   on .terraform/modules/kube-hetzner/agents.tf line 107, in resource "null_resource" "agents":
│  107:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_1588448047.sh": Process exited with status 124

How add new additional node to k3s?

CroutonDigital commented 10 months ago

When I connect to VM:

h-k3s-test-bots-large-wto:~ # journalctl -xeu k3s-agent
░░ The error number returned by this process is ERRNO.
Jan 05 07:49:17 h-k3s-test-bots-large-wto (k3s)[3475]: k3s-agent.service: Failed at step EXEC spawning /usr/local/bin/k3s: Permission denied
░░ Subject: Process /usr/local/bin/k3s could not be executed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The process /usr/local/bin/k3s could not be executed and failed.
░░ 
░░ The error number returned by this process is ERRNO.
Jan 05 07:49:17 h-k3s-test-bots-large-wto systemd[1]: k3s-agent.service: Main process exited, code=exited, status=203/EXEC

PS: Autoscaller create new 6 VMs and I don't see on k3s )))))

CroutonDigital commented 10 months ago

restorecon -v /usr/local/bin/k3s helped, too

janhaa commented 10 months ago

See also for a possible workaround: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1145#issuecomment-1875459438

Silvest89 commented 10 months ago

See also for a possible workaround: #1145 (comment)

@mysticaltech What do you think of this issue and the work around?

mysticaltech commented 10 months ago

@Silvest89 I think the work around is safe to do just after setup. I will introduce it right away. And will also update the k3s selinux package.

mysticaltech commented 10 months ago

@janhaa @CroutonDigital This is fixed in v2.11.4, please upgrade to it with terraform init -upgrade.

CroutonDigital commented 10 months ago

Thank you! All worked fine

Taronyuu commented 10 months ago

@mysticaltech I just ran into this issue while updating my cluster, remembered this issue and upgraded right away. All solved now. Just wanted to thank you for your effort 🙏🏻

jimping commented 9 months ago

I am getting the same error. Newest Version, Mac, Fresh Install unchanged config (except hcloud token)

mysticaltech commented 9 months ago

@jimping Please open a new issue with all the details to reproduce.