kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.33k stars 356 forks source link

Upgrading Module from v1.4.x to >=v1.5.x fails #489

Closed me-marcel closed 1 year ago

me-marcel commented 1 year ago

Description Upgrading the Module from v1.4.x to >=v1.5.x fails.

Error

Error: no Load Balancer found with name kube1
│
│   with module.kube-hetzner.data.hcloud_load_balancer.cluster[0],
│   on .terraform/modules/kube-hetzner/data.tf line 20, in data "hcloud_load_balancer" "cluster":
│   20: data "hcloud_load_balancer" "cluster" {

Environment

locals {
  # Fill first and foremost your Hetzner API token, found in your project, Security, API Token, of type Read & Write.
  hcloud_token = "XXXXXXXXXXxxxxxxxxxxXXXXXXXXXXxxxxxxxxxxXXXXXXXXXXxxxxxxxxxxXXXX"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token    = local.hcloud_token
  source          = "kube-hetzner/kube-hetzner/hcloud"
  version         = "1.4.8"
  ssh_public_key  = file("./xxx")
  ssh_private_key = file("./yyy")
  network_region  = "eu-central"
  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    }
  ]
  agent_nodepools = [
    {
      name        = "agent-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    }
  ]
  load_balancer_type                = "lb11"
  load_balancer_location            = "fsn1"
  base_domain                       = "xxx.dev"
  traefik_acme_tls                  = true
  traefik_acme_email                = "aaa@bbb.ccc"
  traefik_enabled                   = true
  metrics_server_enabled            = true
  allow_scheduling_on_control_plane = false
  automatically_upgrade_k3s         = true
  cluster_name                      = "kube1"
  use_cluster_name_in_node_name     = true
  placement_group_disable           = true
  enable_cert_manager               = true
  use_control_plane_lb              = true
}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.2.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.33.2"
    }
  }
  backend "http" {
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

Reproduction

  1. Set up a cluster using the module version <=v1.4.x
  2. Bump the version of the TF module to >=v1.5.x in your .tf file
  3. Perform a terraform plan - It's important that your Terraform state contains an applied configuration using the module <=v1.4.x
  4. The error should now occur

Workaround

It'd be great if we were able to find a permanently viable solution for this.

me-marcel commented 1 year ago

Had a look at the module. It seems like v1.4.x defined the load balancer data as follows:

data "hcloud_load_balancer" "traefik" {
  count = local.using_klipper_lb ? 0 : var.traefik_enabled == false ? 0 : 1
  name  = "${var.cluster_name}-traefik"

  depends_on = [null_resource.kustomization]
}

The newer module versions >=1.5.x do it as follows:

data "hcloud_load_balancer" "cluster" {
  count = local.using_klipper_lb ? 0 : local.ingress_controller == "none" ? 0 : 1
  name  = var.cluster_name

  depends_on = [null_resource.kustomization]
}

That's the reason a migration between the module versions isn't possible. A simple patch would take care of this though:

data "hcloud_load_balancer" "cluster" {
  count = local.using_klipper_lb ? 0 : local.ingress_controller == "none" ? 0 : 1
  name  = local.ingress_controller == "traefik" ? "${var.cluster_name}-traefik" : var.cluster_name

  depends_on = [null_resource.kustomization]
}

I'll open a pull request shortly.

captnCC commented 1 year ago

@me-marcel Just a quick question: How is the load-balancer of your cluster currently called?

me-marcel commented 1 year ago

@captnCC The original module version deployed the loadbalancers with the names <cluster>-traefik and <cluster>-control-plane. It might be worth mentioning that the option for prefixing the cluster name to the resources was active upon the first deployment which took place like a couple month back already.

captnCC commented 1 year ago

I took a quick look in the version history: up to 1.4.8 the load-balancer indeed was called <cluster>-<ingresstype>. Starting with 1.5.0 the ingress-type suffix was dropped.

This puts now in a bad situation. For the first part, I don't know if the CCM will change the load-balancer name if it's changed in Kubernetes. (If not, maybe deleting it by hand and letting the CCM recreate could help). But besides that the plan/apply would fail anyway because the data source is invalid and can't resolve to an actual resource.

To allow the upgrade we could introduce a variable to set the load-balancer value manually (back to the old format).

As a last straw the hacky solution: Create a load-balancer on hetzner named kube1. This way terraform will be satisfied and can continue the upgrade. (This only will impact the output of the cluster ip)

me-marcel commented 1 year ago

Good thing you were able to confirm it. I may or may not have lost some self-confidence when I came across this behaviour for the first time.

Of course this ain't nice but I am confident that we will be able to find a proper solution for this which is not hacky and does not involve creating additional resources for the sake of making the migration work, because I suspect that my case may not be the only one.

I actually like your suggestion of having a way to override the name since I did take a similar approach in the pull request I opened in regards to this issue (see: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/490). Ofc. this is strongly tailored to the needs I had so there might also be a more general way of approaching this. I'd definately be happy to see such an option in the future.

FYI: I did take the changes of the pull request and applied them in a local version of the module which I referenced for now so I was able to upgrade the respective modules. The cluster's still up and all of the resulting changes were applied successfully. So there is no pressure of fixing this within the repo but it'll be nice to see such an option in the future to allow for backwards-compatibility.

mysticaltech commented 1 year ago

Replied in the PR.