kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.29k stars 350 forks source link

Autoscaler nodes are not responsive #656

Closed bulnv closed 1 year ago

bulnv commented 1 year ago

Description

Hey! I am trying to spin up autoscaler on my cluster. The autoscaler pod is working fine. Servers are spawning when its needed, they even show up as green in hetzner console, but they aren't able to join the cluster or let me in via ssh. Please see the screenshot below what I was able to get from console. What are the options? can I login with the password on the console?

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "1.9.8"
  ssh_public_key  = file("./id_ed25519_k8s.pub")
  ssh_private_key = file("./id_ed25519_k8s")
  network_region  = "eu-central" # change to `us-east` if location is ash
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"
  cni_plugin = "calico"
  control_plane_nodepools = [    {
      name        = "control-plane",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    } 
    ]
  agent_nodepools =  [
    {
      name        = "agent",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 4
    }
  ]
  autoscaler_nodepools = [
    {
      name        = "autoscaled-cpx31"
      server_type = "cpx31" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "nbg1"
      min_nodes   = 0
      max_nodes   = 1
  }]

  enable_cert_manager = true
  # etcd_s3_backup = {
  #   etcd-s3-endpoint   = "***"
  #   etcd-s3-access-key = "k8s-backups"
  #   etcd-s3-secret-key = "***"
  #   etcd-s3-bucket     = "k8s-backups"
  # }
  automatically_upgrade_k3s = false
  automatically_upgrade_os = false
  cluster_name = format("%s-%s-k8s", local.project, local.env)
  restrict_outbound_traffic = false
  disable_network_policy = true
}

resource "kubernetes_namespace" "this" {
  for_each = { for k, v in local.namespaces: k => v}
  metadata {
    name = each.value.name
  }
}

Screenshots

Screenshot from 2023-03-15 21-22-43

Platform

linux

bulnv commented 1 year ago

the machine is available sometime from local-hetzner-network, so I was able to get some logs. k3s-agent not installed. Here what I've found from the cloud-init-output log, posting here so far only errors cause its huge

2023-03-06 15:32:41,671 - util.py[WARNING]: failed stage init-local
failed run of stage init-local
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname
    uid = pwd.getpwnam(user).pw_uid
KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init
    init.apply_network_config(bring_up=bring_up_interfaces)
  File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config
    return self.distro.apply_network_config(
  File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 278, in apply_network_config
    self._write_network_state(network_state, renderer)
  File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 167, in _write_network_state
    renderer.render_network_state(network_state)
  File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state
    self.create_network_file(k, v, network_dir)
  File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file
    util.chownbyname(net_fn, net_fn_owner, net_fn_owner)
  File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname
    raise OSError("Unknown user or group: %s" % (e)) from e
OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"
sed: can't read /etc/sysconfig/network/config: No such file or directory
sed: can't read /etc/sysconfig/network/dhcp: No such file or directory
sed: can't read /etc/sysconfig/network/config: No such file or directory
+ curl -sfL https://get.k3s.io
+ INSTALL_K3S_SKIP_START=true
+ INSTALL_K3S_SKIP_SELINUX_RPM=true
+ INSTALL_K3S_CHANNEL=v1.25
+ INSTALL_K3S_EXEC=agent
+ sh -
+ /sbin/semodule -v -i /usr/share/selinux/packages/k3s.pp
Attempting to install module '/usr/share/selinux/packages/k3s.pp':
Ok: return value of 0.
Committing changes:
Ok: transaction number 7.
Failed to start k3s-agent.service: Unit k3s-agent.service not found.
2023-03-15 21:38:04,372 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-03-15 21:38:04,373 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.10/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 23.1-1.1 finished at Wed, 15 Mar 2023 21:38:04 +0000. Datasource DataSourceHetzner.  Up 50.61 seconds
mysticaltech commented 1 year ago

@bulnv This is caused by an old bug. Here's what to try (basically recreate the snapshot used by the autoscaler after upgrading):

bulnv commented 1 year ago

@mysticaltech thank you so much for a detailed answer (as usual). Everything seems clear. Will try it out.

bulnv commented 1 year ago

@mysticaltech just tried what you mentioned and have some updates. I've successfully scaled masters up to 3 nodes. But when I am trying to destroy the first node I've got in a plan to destroy 29 resources including null resources for all the nodes, the load balancer, and all my helm releases. So I guess this terraform action can just blow up the cluster. So I did next:

looking forward to hearing from you any ideas

bulnv commented 1 year ago

only one thing, I am still on a relatively fresh "1.9.8" version. Does this make sense?

mysticaltech commented 1 year ago

@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.

bulnv commented 1 year ago

@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.

Nonono =)), I mean the problem persists even with different snapshot, freshly taken from the different master!

mysticaltech commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

mysticaltech commented 1 year ago

Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).

bulnv commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that

bulnv commented 1 year ago

Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).

ok, will try out

mysticaltech commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that

Impossible, that will just delete the autoscaler image! Please try again, or post the plan to prove me wrong haha.

mysticaltech commented 1 year ago

But upgrade to 1.10.3 please!

mysticaltech commented 1 year ago

Change the version and then terraform init -upgrade.

bulnv commented 1 year ago

Change the version and then terraform init -upgrade.

Sure thing, on my way

bulnv commented 1 year ago

@mysticaltech sorry for confusion. i've managed to left my helms untouched, here is the actual plan. Does it looks safe and gonna remove only one master node?


Terraform will perform the following actions:

  # local_file.kubeconfig will be destroyed
  - resource "local_file" "kubeconfig" {
      - content              = (sensitive value) -> null
      - directory_permission = "0777" -> null
      - file_permission      = "0777" -> null
      - filename             = "/home/nbuashev/.kube/hetzner" -> null
      - id                   = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
    }

  # module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] will be destroyed
  - resource "hcloud_snapshot" "autoscaler_image" {
      - description = "Initial snapshot used for autoscaler" -> null
      - id          = "103210203" -> null
      - labels      = {
          - "autoscaler"  = "true"
          - "cluster"     = "botbiz-production-k8s"
          - "engine"      = "k3s"
          - "provisioner" = "terraform"
        } -> null
      - server_id   = 29666577 -> null
    }

  # module.kube-hetzner.local_file.kustomization_backup[0] will be destroyed
  - resource "local_file" "kustomization_backup" {
      - content              = <<-EOT
            "apiVersion": "kustomize.config.k8s.io/v1beta1"
            "kind": "Kustomization"
            "patchesStrategicMerge":
            - |
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: system-upgrade-controller
                namespace: system-upgrade
              spec:
                template:
                  spec:
                    containers:
                      - name: system-upgrade-controller
                        volumeMounts:
                          - name: ca-certificates
                            mountPath: /var/lib/ca-certificates
                    volumes:
                      - name: ca-certificates
                        hostPath:
                          path: /var/lib/ca-certificates
                          type: Directory
            - "kured.yaml"
            - "ccm.yaml"
            - "calico.yaml"
            "resources":
            - "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.14.1/ccm-networks.yaml"
            - "https://github.com/weaveworks/kured/releases/download/1.12.2/kured-1.12.2-dockerhub.yaml"
            - "https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml"
            - "hcloud-csi.yml"
            - "traefik_ingress.yaml"
            - "https://raw.githubusercontent.com/projectcalico/calico/v3.25.0/manifests/calico.yaml"
            - "cert_manager.yaml"
        EOT -> null
      - content_base64sha256 = "IfadrAeHqYq5kmPHEYPl2ja5LLhrkM+BNPqUZm3MYuE=" -> null
      - content_base64sha512 = "DHwgzdaNht95Zb2kiBKCoS0kqyKBsg/kId3CmXmL1LWhysrjFMe7ReuhCUsokJq5IbSFFcfVe787uiQ2OsvM5Q==" -> null
      - content_md5          = "f1b53f4455e272a9104a45e823ded9fe" -> null
      - content_sha1         = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
      - content_sha256       = "21f69dac0787a98ab99263c71183e5da36b92cb86b90cf8134fa94666dcc62e1" -> null
      - content_sha512       = "0c7c20cdd68d86df7965bda4881282a12d24ab2281b20fe421ddc299798bd4b5a1cacae314c7bb45eba1094b28909ab921b48515c7d57bbf3bba24363acbcce5" -> null
      - directory_permission = "0777" -> null
      - file_permission      = "600" -> null
      - filename             = "botbiz-production-k8s_kustomization_backup.yaml" -> null
      - id                   = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
    }

  # module.kube-hetzner.local_sensitive_file.kubeconfig[0] will be destroyed
  - resource "local_sensitive_file" "kubeconfig" {
      - content              = (sensitive value)
      - directory_permission = "0700" -> null
      - file_permission      = "600" -> null
      - filename             = "botbiz-production-k8s_kubeconfig.yaml" -> null
      - id                   = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
    }

  # module.kube-hetzner.null_resource.agents["0-0-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "3012830062625593000" -> null
      - triggers = {
          - "agent_id" = "29666578"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-1-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "5793387825081006408" -> null
      - triggers = {
          - "agent_id" = "29666576"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-2-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "2083069504276235938" -> null
      - triggers = {
          - "agent_id" = "29915387"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-3-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "4994720741323820131" -> null
      - triggers = {
          - "agent_id" = "29915384"
        } -> null
    }

  # module.kube-hetzner.null_resource.configure_autoscaler[0] will be destroyed
  - resource "null_resource" "configure_autoscaler" {
      - id       = "5520672443837788890" -> null
      - triggers = {
          - "template" = >>>EOT
            EOT
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-0-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "7282537881054730684" -> null
      - triggers = {
          - "control_plane_id" = "29666577"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-1-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "7092987203276066952" -> null
      - triggers = {
          - "control_plane_id" = "30069994"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-2-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "2153595407490067548" -> null
      - triggers = {
          - "control_plane_id" = "30069993"
        } -> null
    }

  # module.kube-hetzner.null_resource.first_control_plane will be destroyed
  - resource "null_resource" "first_control_plane" {
      - id = "4534639624935671890" -> null
    }

  # module.kube-hetzner.null_resource.kustomization will be destroyed
  - resource "null_resource" "kustomization" {
      - id       = "2531306948694365826" -> null
      - triggers = {
          - "helm_values_yaml" = (sensitive value)
          - "options"          = ""
          - "versions"         = <<-EOT
                v1.25.0
                N/A
                N/A
                N/A
                N/A
            EOT
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server.server will be destroyed
  - resource "hcloud_server" "server" {
      - allow_deprecated_images    = false -> null
      - backups                    = false -> null
      - datacenter                 = "nbg1-dc3" -> null
      - delete_protection          = false -> null
      - firewall_ids               = [
          - 764142,
        ] -> null
      - id                         = "29666577" -> null
      - ignore_remote_firewall_ids = false -> null
      - image                      = "ubuntu-20.04" -> null
      - ipv4_address               = "5.75.160.152" -> null
      - ipv6_address               = "2a01:4f8:c2c:fe9c::1" -> null
      - ipv6_network               = "2a01:4f8:c2c:fe9c::/64" -> null
      - keep_disk                  = false -> null
      - labels                     = {
          - "cluster"     = "botbiz-production-k8s"
          - "engine"      = "k3s"
          - "provisioner" = "terraform"
          - "role"        = "control_plane_node"
        } -> null
      - location                   = "nbg1" -> null
      - name                       = "botbiz-production-k8s-control-plane-jlr" -> null
      - placement_group_id         = 134623 -> null
      - rebuild_protection         = false -> null
      - rescue                     = "linux64" -> null
      - server_type                = "cpx31" -> null
      - ssh_keys                   = [
          - "10384165",
        ] -> null
      - status                     = "running" -> null
      - user_data                  = "/tzgSRzy9MJHfeGib23VfLR+aaU=" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server_network.server will be destroyed
  - resource "hcloud_server_network" "server" {
      - alias_ips   = [] -> null
      - id          = "29666577-2605336" -> null
      - ip          = "10.255.0.101" -> null
      - mac_address = "86:00:00:3b:4f:5d" -> null
      - server_id   = 29666577 -> null
      - subnet_id   = "2605336-10.255.0.0/16" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].null_resource.registries will be destroyed
  - resource "null_resource" "registries" {
      - id       = "2799499625627435179" -> null
      - triggers = {
          - "registries" = " "
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.identity_file will be destroyed
  - resource "random_string" "identity_file" {
      - id          = "dc1ohfonfm1ezyf9b0a3" -> null
      - length      = 20 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = true -> null
      - numeric     = true -> null
      - result      = "dc1ohfonfm1ezyf9b0a3" -> null
      - special     = false -> null
      - upper       = false -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.server will be destroyed
  - resource "random_string" "server" {
      - id          = "jlr" -> null
      - keepers     = {
          - "name" = "botbiz-production-k8s-control-plane"
        } -> null
      - length      = 3 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = false -> null
      - numeric     = false -> null
      - result      = "jlr" -> null
      - special     = false -> null
      - upper       = false -> null
    }

Plan: 0 to add, 0 to change, 19 to destroy.```
mysticaltech commented 1 year ago

@bulnv The only thing important now is the autoscaler_image, so basically you add helm releases, those are not from the module, make sure they are not dependent on the autoscaler_image, there is an implicit or explicit dependency somewhere.

mysticaltech commented 1 year ago

Please just destroy module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], forget about the first control plane. And find the implicit or explicit dependencies that create the need to destroy your helm releases. You fix this, and you recreate the snapshot through terraform and it will be fixed for good!

mysticaltech commented 1 year ago

@bulnv The reason why the autoscaler image is important to recreate is that the new version of the project holds the right cloud-init for it. So normally the first control plane node doesn't matter, as the new cloud-init reconfigures everything the right way so that k3s starts correctly on the autoscaled node!

mysticaltech commented 1 year ago

(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).

If you have access to it yourself, the secret is to pass it in multiple chuncks.

bulnv commented 1 year ago

(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).

If you have access to it yourself, the secret is to pass it in multiple chuncks.

hah, that's a good one actually! luckily I was able destroy it by myself. And apply it. Before that I've tried to use hack with changing HCLOUD_IMAGE. All what I've get so far: node is booting up, k3s failed, but at least installed. Here is the output of the k3s unit log

 systemctl[3305]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
 k3s[3308]: time="2023-03-16T17:26:29Z" level=fatal msg="--token is required"
 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE
 systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
 systemd[1]: Failed to start Lightweight Kubernetes.
mysticaltech commented 1 year ago

@bulnv Good progress. Now we need to figure out what's happening.

Please cd /etc/rancher/k3s/, cat the config file and see if the token is present and if everything looks ok.

Also have a look at /var/pre_install/ see the content of the file there.

Run ip address show, see if eth1 is present.

Post the output of journalctl -u k3s-agent.

And also the content of /var/log/cloud-init/cloud-init-user.log (if I remember correctly the path).

Somewhere above the problem should show up, please share the output if needed.

bulnv commented 1 year ago

@bulnv Good progress. Now we need to figure out what's happening.

Please cd /etc/rancher/k3s/, cat the config file and see if the token is present and if everything looks ok.

Also have a look at /var/pre_install/ see the content of the file there.

Run ip address show, see if eth1 is present.

Post the output of journalctl -u k3s-agent.

And also the content of /var/log/cloud-init/cloud-init-user.log (if I remember correctly the path).

Somewhere above the problem should show up, please share the output if needed.

Heh! Working on it right now! what I've figured out so far:

- journalctl -eu k3s-agent I gave in the previous message, nothing besides that

9 sh[1365]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 9 systemctl[1366]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory 9 k3s[1369]: time="2023-03-16T20:42:19Z" level=fatal msg="--token is required" 9 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE 9 systemd[1]: k3s-agent.service: Failed with result 'exit-code'. lines 1-64


- cat /var/log/cloud-init-output.log 

Cloud-init v. 23.1-1.1 running 'init-local' at Mon, 06 Mar 2023 15:32:36 +0000. Up 15.54 seconds. 2023-03-06 15:32:41,671 - util.py[WARNING]: failed stage init-local failed run of stage init-local

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config return self.distro.apply_network_config( File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 278, in apply_network_config self._write_network_state(network_state, renderer) File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 167, in _write_network_state renderer.render_network_state(network_state) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state self.create_network_file(k, v, network_dir) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file util.chownbyname(net_fn, net_fn_owner, net_fn_owner) File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname raise OSError("Unknown user or group: %s" % (e)) from e OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"

Cloud-init v. 23.1-1.1 running 'init' at Mon, 06 Mar 2023 15:32:42 +0000. Up 21.54 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: Device Up Address Mask Scope Hw-Address ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: eth0 True 5.75.160.152 255.255.255.255 global 96:00:01:f9:ab:df ci-info: eth0 True fe80::c784:df14:b615:4bfb/64 . link 96:00:01:f9:ab:df ci-info: lo True 127.0.0.1 255.0.0.0 host . ci-info: lo True ::1/128 . host . ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: Route Destination Gateway Genmask Interface Flags ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: 0 0.0.0.0 172.31.1.1 0.0.0.0 eth0 UG ci-info: 1 172.31.1.1 0.0.0.0 255.255.255.255 eth0 UH ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: Route Destination Gateway Interface Flags ci-info: +-------+-------------+---------+-----------+-------+ ci-info: 1 fe80::/64 :: eth0 U ci-info: 3 multicast :: eth0 U ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:/mX33L/ySw0Fv5nifffjQzGBsCQYhY+qE+SmI7Y7S00 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[RSA 3072]----+ .=o o. o o o .. + o . + . . . ++ oE . S ..+o o+ . . . oo. .o.o . o.oo+ o= o . o o.== o+* . . +=% +----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:Z2i1kCPIAJY1saWeE7wYAXySEjl6BYJv51LAKZB62cI root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[DSA 1024]----+ @O**.. O*+. . +*oXo . + . o E B . = . = X S + . o . o .
+----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:GDYiQdnLkURLtEBLjJYfsdOdjqUmUqpFPvtCNTfJH1Y root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[ECDSA 256]---+ =BB*. .+o**o. . E . ====+= . ooo*oX+o o+o *.=S. ..ooo . ... .. .. +----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:Bu2s6sozf8LH6gzDpSCZ7RoqzbG4Z/6CwPTxfL8q9F4 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +--[ED25519 256]--+
.
. .
= . +
B o = S
o=.+ + +
o=*+..+ .E
=+X+oo+...
==+XO=oo...

+----[SHA256]-----+ Cloud-init v. 23.1-1.1 running 'modules:config' at Mon, 06 Mar 2023 15:32:44 +0000. Up 23.62 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Mon, 06 Mar 2023 15:32:44 +0000. Up 24.00 seconds. sed: can't read /etc/sysconfig/network/config: No such file or directory sed: can't read /etc/sysconfig/network/dhcp: No such file or directory sed: can't read /etc/sysconfig/network/config: No such file or directory Removed "/etc/systemd/system/multi-user.target.wants/rebootmgr.service". Cloud-init v. 23.1-1.1 finished at Mon, 06 Mar 2023 15:32:45 +0000. Datasource DataSourceHetzner. Up 24.48 seconds Cloud-init v. 23.1-1.1 running 'init-local' at Mon, 06 Mar 2023 15:33:51 +0000. Up 6.36 seconds. Cloud-init v. 23.1-1.1 running 'init' at Mon, 06 Mar 2023 15:33:51 +0000. Up 7.11 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | eth0 | True | 5.75.160.152 | 255.255.255.255 | global | 96:00:01:f9:ab:df | ci-info: | eth0 | True | fe80::c784:df14:b615:4bfb/64 | . | link | 96:00:01:f9:ab:df | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 1 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | eth0 | U | ci-info: | 3 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Cloud-init v. 23.1-1.1 running 'modules:config' at Mon, 06 Mar 2023 15:33:52 +0000. Up 7.71 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Mon, 06 Mar 2023 15:33:52 +0000. Up 8.09 seconds. Cloud-init v. 23.1-1.1 finished at Mon, 06 Mar 2023 15:33:52 +0000. Datasource DataSourceHetzner. Up 8.16 seconds Cloud-init v. 23.1-1.1 running 'init-local' at Thu, 16 Mar 2023 20:40:52 +0000. Up 4.58 seconds. 2023-03-16 20:40:57,704 - util.py[WARNING]: failed stage init-local failed run of stage init-local

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config return self.distro.apply_network_config( File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 278, in apply_network_config self._write_network_state(network_state, renderer) File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 167, in _write_network_state renderer.render_network_state(network_state) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state self.create_network_file(k, v, network_dir) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file util.chownbyname(net_fn, net_fn_owner, net_fn_owner) File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname raise OSError("Unknown user or group: %s" % (e)) from e OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"

Cloud-init v. 23.1-1.1 running 'init' at Thu, 16 Mar 2023 20:40:58 +0000. Up 10.53 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | enp7s0 | True | 10.0.0.2 | 255.255.255.255 | global | 86:00:00:3d:27:c0 | ci-info: | enp7s0 | True | fe80::58a2:41c7:301b:10b0/64 | . | link | 86:00:00:3d:27:c0 | ci-info: | eth0 | True | 128.140.3.5 | 255.255.255.255 | global | 96:00:02:02:02:03 | ci-info: | eth0 | True | fe80::9cfa:c039:f3b7:6596/64 | . | link | 96:00:02:02:02:03 | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | enp7s0 | UG | ci-info: | 1 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 2 | 10.0.0.0 | 10.0.0.1 | 255.0.0.0 | enp7s0 | UG | ci-info: | 3 | 10.0.0.1 | 0.0.0.0 | 255.255.255.255 | enp7s0 | UH | ci-info: | 4 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | enp7s0 | U | ci-info: | 2 | fe80::/64 | :: | eth0 | U | ci-info: | 4 | multicast | :: | enp7s0 | U | ci-info: | 5 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:D1VnepUBoe8uK/89aUMStLs7RBnRRcejBp/yyTakvEQ root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[RSA 3072]----+ | .+B+O| | o. =o| | ..= B .| | . E.% | | S o O.+ | | o +.X . | | o +.= .| | . o.o.= | | oo+++.o| +----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:TbbywS2yYdG+zXzJ9h5tcRazpmR4TUsslaTBWx2eP34 root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[DSA 1024]----+ | ...++| | . .oo| | . + ooO | | B o..= | | S B..+ B+| | . B+.+o=| | . o +.=oE| | o oo| | .o| +----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:5IThG4cF8lX1ZxNt8c+MwcTdf276z+I0iwPLLqNomnU root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[ECDSA 256]---+ | . o.o......++| | + +. B| | + +o=| | B B=| | . S ..=| | . o| | . E . o oo | | o.o o o .oo+ | | oo. .. +. .ooo=| +----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:qA2W9qZ4VStAwiZoKPG76ZBLqDu1i1s0GjcqwlLL+6U root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +--[ED25519 256]--+ |+o | |=o+ . | |oo.o | | ... .. | |. B =...S. | |oX.O =o . | |X+...=. | |B.o.= | |*+=+E | +----[SHA256]-----+ Cloud-init v. 23.1-1.1 running 'modules:config' at Thu, 16 Mar 2023 20:40:59 +0000. Up 11.62 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Thu, 16 Mar 2023 20:40:59 +0000. Up 12.01 seconds. ++ id -u

The following package is going to be REMOVED: k3s-selinux

1 package to remove. After the operation, 95.0 KiB will be freed. Continue? [y/n/v/...? shows all options] (y): y (1/1) Removing k3s-selinux-0.0~bd1f1455dirty-0.sle.noarch [...done]

2023-03-16 20:41:10 Application returned with exit status 0. 2023-03-16 20:41:11 Transaction completed. 2023-03-16 20:41:11 tukit 4.1.3 started 2023-03-16 20:41:11 Options: --discard close 4 /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 2023-03-16 20:41:11 New default snapshot is #4 (/.snapshots/4/snapshot). 2023-03-16 20:41:11 Transaction completed.

Please reboot your machine to activate the changes and avoid data loss. New default snapshot is #4 (/.snapshots/4/snapshot). transactional-update finished

bulnv commented 1 year ago

my guess is ${base64encode(k3s_config)} is not working inside of template file

mysticaltech commented 1 year ago

It could be, but it's super weird because it works for in all my tests and for others too! 🤯

Please debug some more, at this point I am dry! If I can think of something I will let you know.

bulnv commented 1 year ago

It could be, but it's super weird because it works for in all my tests and for others too! exploding_head

Please debug some more, at this point I am dry! If I can think of something I will let you know.

yeah, changed tmpl and redeployed, base64 looks valid, and this is strange. Looks like I missing something

mysticaltech commented 1 year ago

@bulnv Try rebooting the node manually, see if it comes online!

mysticaltech commented 1 year ago

It should reboot after cloud-init, maybe it your case maybe it's not rebooting, so maybe use the reboot command at the end of user-config, instead of the power state statement.

mysticaltech commented 1 year ago

I am pretty convinced, if you reboot, it will join the cluster. I'm guessing the power state did not work for that old version of MicroOS. So we can use the reboot statement instead.

mysticaltech commented 1 year ago

Also run journalctl -u install-k3s-agent please.

bulnv commented 1 year ago

Also run journalctl -u install-k3s-agent please.

mysticaltech commented 1 year ago

Good to hear @bulnv, PR is most welcome, am curious about this, it could be an edge case!

bulnv commented 1 year ago

@mysticaltech please see my changes here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/660 yes, i agree this could be some edge case. But as far as I understood from the code the flow is following:

IMO copy configs goes bit early, and following uninstall-k3s script removes them (as I checked in the uninstall script code) https://github.com/ruifigueiredo/k3s/blob/master/uninstall.sh#L21 please see line 21. So I made a change to preserve configs from deletion during k3s-uninstall. Let me know WDYT?

mysticaltech commented 1 year ago

@bulnv Yep exactly, that's what's happening. But in my testing it worked, so the timing must change depending on the machine, hence the bug went un-noticed until it showed up for you. Good catch!!

mysticaltech commented 1 year ago

@bulnv Your fix was just released in 1.10.6 🙏🚀