kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.18k stars 343 forks source link

[Bug]: Stuck on "Waiting for the k3s agent to start" #1089

Closed melalj closed 10 months ago

melalj commented 10 months ago

Description

After setting up the cluster, I got stuck on the "(remote-exec): Waiting for the k3s agent to start..."

It has worked before, and now I don't know whether it's an issue from Hetzner or the kube-hetzner terraform module.

Here's an extract of the log:

module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"]: Still creating... [1m50s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"]: Still creating... [1m50s elapsed]
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"]: Still creating... [1m50s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"]: Still creating... [2m0s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"]: Still creating... [2m0s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"]: Still creating... [2m0s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"]: Still creating... [2m10s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"]: Still creating... [2m10s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"]: Still creating... [2m10s elapsed]
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"]: Still creating... [2m20s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"]: Still creating... [2m20s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"]: Still creating... [2m20s elapsed]
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"]: Still creating... [2m30s elapsed]
module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"]: Still creating... [2m30s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"]: Still creating... [2m30s elapsed]
module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"] (remote-exec): Waiting for the k3s agent to start...
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.agents["0-0-node-fsn1-cx41"],
│   on .terraform/modules/kube-hetzner/agents.tf line 107, in resource "null_resource" "agents":
│  107:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_746244749.sh": Process exited with status 124
╵
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.agents["0-2-node-fsn1-cx41"],
│   on .terraform/modules/kube-hetzner/agents.tf line 107, in resource "null_resource" "agents":
│  107:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_2086607887.sh": Process exited with status 124
╵
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.agents["0-1-node-fsn1-cx41"],
│   on .terraform/modules/kube-hetzner/agents.tf line 107, in resource "null_resource" "agents":
│  107:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_390273080.sh": Process exited with status 124

Kube.tf file

terraform {
  required_version = ">= 1.3.5"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "1.44.1"
    }
  }
  backend "gcs" {
    bucket = "tf-backend-xxx"
    credentials = "../_sa/sa-terraform-provider.json"
  }
}

provider "hcloud" {
  token = var.hcloud_token
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.9.3"
  ssh_public_key = file(var.ssh_public_key)
  ssh_private_key = file(var.ssh_private_key)
  network_region = "eu-central"

  ingress_controller = "none"
  cni_plugin = "flannel"
  enable_cert_manager = false
  enable_longhorn = true

  control_plane_nodepools = [
    {
      name        = "control-fsn1-cpx11",
      server_type = "cpx11",
      location    = "fsn1",
      count       = 3,
      labels      = [],
      taints      = []
    },
  ]

  agent_nodepools = [
    {
      name        = "node-fsn1-cx41",
      server_type = "cx41",
      location    = "fsn1",
      count       = 3,
      labels      = [
        "node.longhorn.io/create-default-disk='config'",
        "node.longhorn.io/default-disks-config='[ { \"path\":\"/var/lib/longhorn\",\"allowScheduling\":true, \"storageReserved\":21474836240, \"tags\":[ \"nvme\" ]}, { \"name\":\"hcloud-volume\", \"path\":\"/var/longhorn\",\"allowScheduling\":true, \"storageReserved\":10737418120,\"tags\":[ \"ssd\" ] }]'"
      ],
      taints      = []
      longhorn_volume_size=0
    },
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"
  cluster_name = "my-cluster"
  enable_wireguard = true

  disable_hetzner_csi = true

  cilium_values = <<EOT
ipam:
  mode: kubernetes
k8s:
  requireIPv4PodCIDR: true
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: "10.0.0.0/8"
endpointRoutes:
  enabled: true
loadBalancer:
  acceleration: native
bpf:
  masquerade: true
encryption:
  enabled: true
  type: wireguard
MTU: 1450
EOT

  longhorn_values = <<EOF
defaultSettings:
  createDefaultDiskLabeledNodes: true
  kubernetesClusterAutoscalerEnabled: true # if autoscaler is active in the cluster
  defaultDataPath: /var/lib/longhorn
  # ensure pod is moved to an healthy node if current node is down:
  node-down-pod-deletion-policy: delete-both-statefulset-and-deployment-pod
persistence:
  defaultClass: true
  defaultFsType: ext4
  defaultClassReplicaCount: 3
EOF

  extra_firewall_rules = [
    # all TCP
    {
      description     = "TCP all"
      direction       = "out"
      protocol        = "tcp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    # all UDP
    {
      description     = "UDP all"
      direction       = "out"
      protocol        = "udp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

Screenshots

No response

Platform

Mac

melalj commented 10 months ago

When I removed the labels, it worked

labels      = [
        "node.longhorn.io/create-default-disk='config'",
        "node.longhorn.io/default-disks-config='[ { \"path\":\"/var/lib/longhorn\",\"allowScheduling\":true, \"storageReserved\":21474836240, \"tags\":[ \"nvme\" ]}, { \"name\":\"hcloud-volume\", \"path\":\"/var/longhorn\",\"allowScheduling\":true, \"storageReserved\":10737418120,\"tags\":[ \"ssd\" ] }]'"
      ],

Maybe there should be a more clear error, than just being stuck?

Still having issues with setting up longhorn: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/957#discussioncomment-7515853

JWDobken commented 1 month ago

@melalj I experienced the exact same issue, removing the labels resolved the issue