module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller

After removing the taints from the agent nodes, it's still not working to deploy a cluster. I already tried a fresh new project, fresh new project, and still getting problems with the kustomization.

It's getting really frustrating to have this thing "stable" working. Every change and update/release there is something that just doesn't want to work. It doesn't give me any trust to use this reliable in a production environment if things keep breaking that easy.

module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 232, in resource "null_resource" "kustomization":
│  232:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_1920984306.sh": Process exited with status 1
╵
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.module.agents["1-0-A-CPX51-NBG1"].hcloud_server.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 64, in resource "hcloud_server" "server":
│   64:   provisioner "remote-exec" {
│ 
│ timeout - last error: dial tcp 5.75.144.110:22: connect: connection refused

locals {
  hcloud_token = "xxx"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token

  # * For local dev, path to the git repo
  source = "kube-hetzner/kube-hetzner/hcloud"

  # version = "1.2.0"
  # ssh_port = 2222

  ssh_public_key = file("clusterkey.pub")
  ssh_private_key = file("clusterkey")
  # ssh_hcloud_key_label = "role=admin"
  # hcloud_ssh_key_id = ""
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "CP-FSN1",
      server_type = "cpx31",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "CP-NBG1",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "CP-HEL1",
      server_type = "cpx31",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "A-CPX51-FSN1",
      server_type = "cpx51",
      location    = "fsn1",
      labels      = [],
      taints = [],
      count       = 1
    },
        {
      name        = "A-CPX51-NBG1",
      server_type = "cpx51",
      location    = "nbg1",
      labels      = [],
      taints = [],
      count       = 1
    },
        {
      name        = "A-CPX51-HEL1",
      server_type = "cpx51",
      location    = "hel1",
      labels      = [],
      taints = [],
      count       = 1
    }
  ]

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb31"
  load_balancer_location = "fsn1"

  base_domain = "cluster.xxx.xxx"

  autoscaler_nodepools = [
    {
      name        = "AS-CPX51-FSN1"
      server_type = "cpx51" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "fsn1"
      min_nodes   = 0
      max_nodes   = 5
    },
    {
      name        = "AS-CPX51-NBG1"
      server_type = "cpx51" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "nbg1"
      min_nodes   = 0
      max_nodes   = 5
    },
    {
      name        = "AS-CPX51-HEL1"
      server_type = "cpx51" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "hel1"
      min_nodes   = 0
      max_nodes   = 5
    }
  ]

  etcd_s3_backup = {
    etcd-s3-endpoint        = "s3.eu-central-2.wasabisys.com"
    etcd-s3-access-key      = "xxx"
    etcd-s3-secret-key      = "xxx"
    etcd-s3-bucket          = "k3s-etcd-snapshots"
  }

  enable_longhorn = true
  # longhorn_fstype = "xfs"
  # longhorn_replica_count = 3

  # disable_hetzner_csi = true
  # hetzner_ccm_version = ""
  # hetzner_csi_version = ""
  # kured_version = ""

  # enable_nginx = true
  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".
  # enable_traefik = false
  # enable_klipper_metal_lb = "true"
  # traefik_acme_tls = true
  # traefik_acme_email = "mail@example.com"
  traefik_additional_options = ["--tracing=true"]
  enable_metrics_server = true
  # allow_scheduling_on_control_plane = true

  # automatically_upgrade_k3s = false
  # automatically_upgrade_os = false
  # The default is "v1.24".
  initial_k3s_channel = "v1.24"

  # The cluster name, by default "k3s"
  cluster_name = "xxx"
  use_cluster_name_in_node_name = false

  cni_plugin = "cilium"
  # disable_network_policy = true
  # placement_group_disable = true
  block_icmp_ping_in = true
  enable_cert_manager = true
  # dns_servers = []

  use_control_plane_lb = true

  enable_rancher = true
  rancher_hostname = "rancher.xxx.xxx"
  rancher_install_channel = "stable"
  # rancher_bootstrap_password = ""
  # rancher_registration_manifest_url = "https://rancher.xyz.dev/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml"

  # extra_kustomize_parameters={}
  # create_kubeconfig = false
  # create_kustomization = false

}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.3.3"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.35.2"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig_file
  sensitive = true
}

@codeagencybe After all the time we spent helping you, this is what you have to say about this project?! We are a best-effort initiative! We do this out of goodwill and most of our users are very happy.

I believe I explained to you in the past how to debug. Please give it a try.

Here it is again:

Before anything, start by cleaning up your kube.tf of all comments, so you can see the full state clearly. And install the terraform extension in vscode or run terraform format to reformat it well.

Then:

Ssh into the failing node.
Execute the script that is failing, in your case /tmp/terraform_1920984306.sh.
Also have a look at the script with cat /tmp/terraform_1920984306.sh.
Try to understand where it's not working from the output of the execution.
If that does not give you anything, then have a look at the k3s logs, with journalctl, and see if the services are running with systemctl.

Above is the very least. Then you have to debug at the Kubernetes level, see the pods if they are all executing the way they should. If not, see the status of the node, in both cases the kubectl describe comes in very handy. And also kubectl logs for the pods. And last but not least kubectl svc for the LBs.

When you do find the issue, and it comes from us, please open another issue, but with all the details that you gathered in your debugging session, starting with a clear and well-formatted kube.tf, and information about the environment on which it executes, and where and why it fails according to your analysis. And if you want to help us even more, accompany it with a PR that fixes the issue if you know how to fix it. But signaling the issue is indeed the most important so that we know about it.

After debugging the above, if you still cannot corner the issue, try commenting out the cni_plugin option, as cilium, so that flannel is used. If that works, then help to find out what is wrong there would be greatly appreciated.

You have to understand that we do not have a testing env yet, even though one may be in the pipe. This is a highly dynamic environment with a lot of moving parts.

I do not know if you are familiar with Amazon EKS Blueprints, a terraform project like ours but for EKS by Amazon themselves. With millions of dollars, it works less well than ours in my experience. If you do not get the versions right, it breaks, and is really a trial-and-error process to make it work. So I believe that a small hobbyist project like ours is really doing a good enough job, and for free, out of goodwill.

@mysticaltech

I think you misunderstand the situation and my post. My intention never is or was in any negative connotation or no apprecitiation towards this project. I understand and appreciate all help from you and the community just like others.

But you also have to understand the real frustration sometimes also grow when trying to do simple things and they keep failing over and over. That's what I was expressing in my post. The frustrations that a new release is breaking my experience from the previous one. One version (1.6.3) finally got me started and was like a big achievement (at least for me). Moving to 1.6.5 felt like I got knocked down again and start all over. Its just not a nice experience for a beginner.

Thats why I also offered that "video" experience which we discussed before. I still stand by that message and I still want to do this opportunity with you. Your project is an amazing tool, no doubt at all, but I think it really needs better docs for beginners. And as for most beginners, reading tons of docs is not a great start. Sometimes it's all about getting that base up and running with minimal bells and whistles in a very comprensive format, which for most beginners is a video. A simple explainer video that shows all the basic steps to get a k3s cluster running, that's all it takes to get people triggered and move further from there. If you can't get the basics working from the docs available, people get frustrated. Simple as that.

I know this is not like EKS or GKE that pour millions into maintaining their tools, but still it not a pleasant "beginner" experience if the basics just fail.

Not all community members are expert veterans in this stuff like you are and some others. There are many that are just starting to get their feet wet in this whole Kubernetes stuff just like my self. I'm not running anything "special" or critical, I'm just learning like many others with all the pains and gains. So I know troubleshooting is part of getting something running, but having that basics working all the time is so critical for any piece of software.

I'm not a terraform programmer, at least not yet. I recently signed up for a course and going to do that during christmas holidays, but the moment I have the bare minimum knowledge of Terraform, so that doesn't help met yet to understand problems in those files. I'm not a Kubernetes expert, so understanding all the moving parts is why I'm here to learn this from this project.

So I'm not able to contribute (yet) on any of the above 2 parts. But I can do contribute based on writing documentation and giving you feedback about my beginner experience. So far, as honest as I can be, it's been a bit frustrating because moving from version to version it feels like I'm "resetting" each time. I want to help to make that better but I don't know how yet. Except for doing a co-op with you and turn everything into a video format and slap it in your docs for beginners.

This is also my goodwill and free time I invest in your project. If I didn't believe in this, I would already given up way sooner. But I don't want to. I was very close to my goals in v1.6.3.

So again, I'm reaching out. Let's make it better for everyone but also for beginners.

Ok, no worries, a video is most welcome. But something that is important to understand... Once a cluster is up and running, you do not need to "upgrade" the version, as the cluster itself if you leave the defaults, gets upgraded automatically, both nodes and Kubernetes.

If you think you are not running the latest and greatest if you do not upgrade the underlying Terraform project, you are mistaken. It's like using a tool that burns the windows ISO to a CD, once you install Windows, what do you care if the tool that burns the install CD gets a new version or not?! It's really exactly the same thing here. That's why always lock your module version (will update the docs to specify that).

So if you are trying to upgrade from one version of the module to the next, things have a high probability to break, as all the underlying versions change, from the CCM to Cilium, in your case, to Kubernetes or k3s. And no we cannot test it all every time a new dependency updates. But we do our best to have the basics running. And please do not consider Cilium as basics (I will update the docs to reflect that)! And today, our basics do work.

Now for advanced use cases, like yours, we try to help our best.

Last but not least, please remember that all the underlying dependencies fetch dynamically at deployment time, and we use the latest stable by default. But if the new version of rancher breaks for instance, that is out of our control, and we expect users to help in the debugging. We simply cannot do it all.

@mysticaltech

in my test case, I wasn't upgrading an existing cluster. I just started a new one as I read in the 1.6.5 that things got improved on the autoscaler part and kustomization. Since I got the 1.6.3 one up and running just a few days ago, I didn't consider the "loss" the start over and just deploy a new one. But boy that turned out differently lol.

I know upgrading something that is running is a no-go, especially a beast like Kubernetes.

So the issue is then "confirmed" that it's coming from Cilium from some upgrade from CCM? I'm not that familiar with those components yet. But I read and understood that Cilium is the better one than Flannel default? Also, somewhere in the docs, it states that Cilium need to be used for compatiblity with the LB? I don't see it in version 1.6.5 anymore but it was there in previous versions so that pulled me to Cilium as CNI as your docs said for one of the options it had to be Cilium.

About Rancher, I'm still using the "stable" version, there is nothing experimental or beta or anything used. My goal is to run a small production cluster with this at some point.

About the video, I will sent you an email to get in touch and schedule a meeting so we can do a 1:1 session for this on the format etc...

@codeagencybe Nothing is confirmed, I'm just giving examples, and have explained the debug process above. Please do it, and tell me what's not working for you. When you find the core issue, you can open an issue again and we will help you fix it.

Here, do that please, get to the bottom of the issue, don't just assume that all is related to versions of Kube-Hetzner, it's not!

Before anything, start by cleaning up your kube.tf of all comments, so you can see the full state clearly. And install the terraform extension in vscode or run terraform format to reformat it well.

Then:

Ssh into the failing node.
Execute the script that is failing, in your case /tmp/terraform_1920984306.sh.
Also have a look at the script with cat /tmp/terraform_1920984306.sh.
Try to understand where it's not working from the output of the execution.
If that does not give you anything, then have a look at the k3s logs, with journalctl, and see if the services are running with systemctl.

kube-hetzner / terraform-hcloud-kube-hetzner

module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller #429