[Bug]: One control plane node stuck waiting for MicroOS

heysarver commented 2 months ago

Description

I'm trying to deploy a cluster with 3 or 5 control nodes, both have the same result. N-1 nodes come up successfully but after several terraform destroy and apply plans there's always 1 control node that is stuck in "Waiting for MicroOS to become available..." until terraform times out.

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "~> 2.14.0"
  hcloud_token = var.hcloud_token

  ssh_public_key = var.ssh_public_key
  ssh_private_key = var.ssh_private_key

  network_region = "us-east"

  initial_k3s_channel    = "v1.29"

  cluster_name = "k8s-primary"
  base_domain = "hzr.*******.net"

  control_plane_nodepools = [
    {
      name        = "control",
      server_type = "cpx41",
      location    = "ash",
      count       = 3,
      labels      = [],
      taints      = []
    }
  ]

  agent_nodepools = [
    {
      name        = "worker",
      server_type = "cpx21",
      location    = "ash",
      labels      = [],
      taints      = [],
      count       = 0
    }
  ]

  allow_scheduling_on_control_plane = true

  load_balancer_type     = "lb11"
  load_balancer_location = "ash"
  load_balancer_disable_ipv6 = true
  load_balancer_algorithm_type = "least_connections"
  load_balancer_health_check_interval = "5s"
  load_balancer_health_check_timeout = "3s"

  use_control_plane_lb = true
  control_plane_lb_type = "lb11"

  cluster_autoscaler_version   = "20240226"
  cluster_autoscaler_log_level = 4

  ingress_controller = "nginx"
  ingress_target_namespace = "ingress-nginx"
  ingress_replica_count = 3

  kured_options = {
    "reboot-days": "su",
    "start-time": "3am",
    "end-time": "8am",
    "time-zone": "America/New_York",
    "lock-ttl" : "30m",
  }

  dns_servers = [
    "8.8.8.8",
    "1.1.1.1",
  ]

  cert_manager_values = <<EOT
installCRDs: true
replicaCount: 3
webhook:
  replicaCount: 3
cainjector:
  replicaCount: 3
  EOT

  nginx_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    annotations:
      "load-balancer.hetzner.cloud/name": "k8s-primary-nginx"
      "load-balancer.hetzner.cloud/use-private-ip": "false"
      "load-balancer.hetzner.cloud/disable-private-ingress": "false"
      "load-balancer.hetzner.cloud/location": "ash"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
  EOT

}

Screenshots

Failed Node: Screenshot 2024-09-20 at 12 21 57 PM

Platform

MacOS, Terraform Cloud

mysticaltech commented 2 months ago

@heysarver Please try rebooting the node with hcloud, see if it fixes it.

JWDobken commented 2 months ago

I experience the same problem!

Two things that I noted.

the particular nodes don't have a Private IP
They are both not connected to the private network:

rebooting does not help
manually connecting to the network neither

heysarver commented 2 months ago

Rebooting solved it but it's still an issue. I added another worker pool and had the same results, all but 1 came up ok and a reboot of that fixed it again.

mysticaltech commented 2 months ago

@heysarver Remove the kured-ttl setting. Remove also the autoscaler version (the default set value is needed).

terraform init -upgrade

Plan B

Make sure the underlying image is good, rebuild it if needed, with the packer command.

Debug cloud-init and what could be happening on boot, ask https://claude.ai for the exact commands and give it the logs.

mysticaltech commented 2 months ago

@JWDobken please create a new issue with all the details.

JWDobken commented 2 months ago

rebuilding the image seemed to have solved my issue, thank you.

heysarver commented 2 months ago

@mysticaltech I've started using it already and have hit my limits on a new account so I'll have to wait to try, but sounds reasonable.

heysarver commented 3 weeks ago

I can confirm this was my issue with kured_options lock-ttl set to 30m.

When I made a new cluster to confirm, I also had to manually open the firewall ports for the nginx ingress load balancer with this config. Any ideas on that or should I open a new issue?

mysticaltech commented 3 weeks ago

@heysarver Please reframe the issue, I'm not understanding clearly the issues you are still facing.

heysarver commented 2 weeks ago

@mysticaltech I'm having to add rules for the destination nginx-ingress ports manually to the firewall after creating, otherwise all the targets are unhealthy. This causes terraform state to get out of sync.

Screenshot 2024-11-08 at 12 45 39 PM Screenshot 2024-11-08 at 12 46 22 PM

mysticaltech commented 2 weeks ago

@heysarver Please open a new issue with the full working kube.tf apart from private info, and steps to reproduce please.

kube-hetzner / terraform-hcloud-kube-hetzner