[Bug]: Agent nodes stuck in "Still creating" after upgrade from 2.14.1 to any newer version

jonasmock commented 2 weeks ago

Description

With version 2.14.1 everything is working fine. But upgrading to any of the newer versions leads to infinite loop of "Still creating" the agent nodes while setup.

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token

  create_kubeconfig    = false
  create_kustomization = false

  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.14.5"

  cluster_name = "xxx"

  ssh_public_key  = var.ssh_public_key
  ssh_private_key = var.ssh_private_key

  network_region = var.network_region

  control_plane_nodepools = var.control_plane_nodepools
  agent_nodepools         = var.agent_nodepools

  load_balancer_type     = var.load_balancer_type
  load_balancer_location = var.load_balancer_location
  lb_hostname            = var.load_balancer_hostname

  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]

  # Ingress configuration
  ingress_controller = "nginx"

  # Storage configuration
  enable_longhorn = true
  longhorn_values = <<EOT
defaultSettings:
  defaultDataPath: /var/longhorn
  backupTarget: "${var.backup_target}"
  backupTargetCredentialSecret: "${var.backup_target_credential_secret}"
  node-down-pod-deletion-policy: delete-both-statefulset-and-deployment-pod
persistence:
  defaultFsType: ext4
  defaultClassReplicaCount: 3
  defaultClass: true
  EOT

  # Schedule automatic reboots e.g. for system upgrades on Sundays between 4am and 8am UTC
  kured_options = {
    "reboot-days" : "su",
    "start-time" : "4am",
    "end-time" : "8am",
    "lock-ttl" : "30m",
  }

  disable_hetzner_csi = true

  extra_firewall_rules = [
    {
      description     = "Allow SMB connections to SMB/CIFS storage"
      direction       = "out"
      protocol        = "tcp"
      port            = "445"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    {
      description     = "Allow Azure Event Hubs connections"
      direction       = "out"
      protocol        = "tcp"
      port            = "9093"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
  ]

}

# Kubernetes Control Plane Nodepools
export TF_VAR_control_plane_nodepools='[
  {
    "name": "control-plane-01",
    "server_type": "cax11",
    "location": "fsn1",
    "labels": [],
    "taints": [],
    "count": 1
  },
  {
    "name": "control-plane-02",
    "server_type": "cax11",
    "location": "fsn1",
    "labels": [],
    "taints": [],
    "count": 1
  },
  {
    "name": "control-plane-03",
    "server_type": "cax11",
    "location": "nbg1",
    "labels": [],
    "taints": [],
    "count": 1
  }
]'

# Kubernetes Agent Nodepools
export TF_VAR_agent_nodepools='[
  {
    "name": "agent-pool-01",
    "server_type": "cax21",
    "location": "fsn1",
    "labels": [],
    "taints": [],
    "count": 1
  },
  {
    "name": "agent-pool-02",
    "server_type": "cax21",
    "location": "fsn1",
    "labels": [],
    "taints": [],
    "count": 1
  },
  {
    "name": "agent-pool-03",
    "server_type": "cax21",
    "location": "nbg1",
    "labels": [],
    "taints": [],
    "count": 1
  }
]'

Screenshots

Platform

Mac

wv-opt-ai commented 1 week ago

I also ran into this, and narrowed it down to disable_hetzner_csi = false being the workaround. Setting disable_hetzner_csi = true is triggering the "still creating", because systemctl start k3s-agent does not return, because the control plane nodes are stuck in a drained state with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule. Don't know the solution though at the code level though.

wv-opt-ai commented 1 week ago

By the way, I think the first commit after which this error appears is 9d6cd42c5973b6860f3cfeb2358f093aedebf511 .

jonasmock commented 1 week ago

I also ran into this, and narrowed it down to disable_hetzner_csi = false being the workaround. Setting disable_hetzner_csi = true is triggering the "still creating", because systemctl start k3s-agent does not return, because the control plane nodes are stuck in a drained state with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule. Don't know the solution though at the code level though.

disable_hetzner_csi = false works for me. But then I have two default storage classes. Not sure if this will cause issues:

kube-hetzner / terraform-hcloud-kube-hetzner