Autoscaler nodes are not responsive

bulnv commented 1 year ago

Description

Hey! I am trying to spin up autoscaler on my cluster. The autoscaler pod is working fine. Servers are spawning when its needed, they even show up as green in hetzner console, but they aren't able to join the cluster or let me in via ssh. Please see the screenshot below what I was able to get from console. What are the options? can I login with the password on the console?

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "1.9.8"
  ssh_public_key  = file("./id_ed25519_k8s.pub")
  ssh_private_key = file("./id_ed25519_k8s")
  network_region  = "eu-central" # change to `us-east` if location is ash
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"
  cni_plugin = "calico"
  control_plane_nodepools = [    {
      name        = "control-plane",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    } 
    ]
  agent_nodepools =  [
    {
      name        = "agent",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 4
    }
  ]
  autoscaler_nodepools = [
    {
      name        = "autoscaled-cpx31"
      server_type = "cpx31" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "nbg1"
      min_nodes   = 0
      max_nodes   = 1
  }]

  enable_cert_manager = true
  # etcd_s3_backup = {
  #   etcd-s3-endpoint   = "***"
  #   etcd-s3-access-key = "k8s-backups"
  #   etcd-s3-secret-key = "***"
  #   etcd-s3-bucket     = "k8s-backups"
  # }
  automatically_upgrade_k3s = false
  automatically_upgrade_os = false
  cluster_name = format("%s-%s-k8s", local.project, local.env)
  restrict_outbound_traffic = false
  disable_network_policy = true
}

resource "kubernetes_namespace" "this" {
  for_each = { for k, v in local.namespaces: k => v}
  metadata {
    name = each.value.name
  }
}

Screenshots

Screenshot from 2023-03-15 21-22-43

Platform

linux

bulnv commented 1 year ago

the machine is available sometime from local-hetzner-network, so I was able to get some logs. k3s-agent not installed. Here what I've found from the cloud-init-output log, posting here so far only errors cause its huge

2023-03-06 15:32:41,671 - util.py[WARNING]: failed stage init-local
failed run of stage init-local
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname
    uid = pwd.getpwnam(user).pw_uid
KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init
    init.apply_network_config(bring_up=bring_up_interfaces)
  File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config
    return self.distro.apply_network_config(
  File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 278, in apply_network_config
    self._write_network_state(network_state, renderer)
  File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 167, in _write_network_state
    renderer.render_network_state(network_state)
  File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state
    self.create_network_file(k, v, network_dir)
  File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file
    util.chownbyname(net_fn, net_fn_owner, net_fn_owner)
  File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname
    raise OSError("Unknown user or group: %s" % (e)) from e
OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"

sed: can't read /etc/sysconfig/network/config: No such file or directory
sed: can't read /etc/sysconfig/network/dhcp: No such file or directory
sed: can't read /etc/sysconfig/network/config: No such file or directory
+ curl -sfL https://get.k3s.io
+ INSTALL_K3S_SKIP_START=true
+ INSTALL_K3S_SKIP_SELINUX_RPM=true
+ INSTALL_K3S_CHANNEL=v1.25
+ INSTALL_K3S_EXEC=agent
+ sh -
+ /sbin/semodule -v -i /usr/share/selinux/packages/k3s.pp
Attempting to install module '/usr/share/selinux/packages/k3s.pp':
Ok: return value of 0.
Committing changes:
Ok: transaction number 7.
Failed to start k3s-agent.service: Unit k3s-agent.service not found.
2023-03-15 21:38:04,372 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-03-15 21:38:04,373 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.10/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 23.1-1.1 finished at Wed, 15 Mar 2023 21:38:04 +0000. Datasource DataSourceHetzner.  Up 50.61 seconds

mysticaltech commented 1 year ago

@bulnv This is caused by an old bug. Here's what to try (basically recreate the snapshot used by the autoscaler after upgrading):

Before anything, delete all failed autoscaled nodes with the hcloud cli or the UI.
Upgrade to the latest by removing the version attribute in your kube.tf if you have it and running terraform init -upgrade.
Make sure you have your control-plane in HA, with at least 3 nodes, change and apply if necessary to make it so.
Use terraform state list, to find the name of your first control plane, alternatively look in your terraform.tfstate file. Usually it starts with "0-0-control-plane-*', you have to find its node name too, the name use in k3s and hcloud.
Drain that node with kubectl drain node <first-control-plane-node-name>.
Terraform destroy that node with terraform destroy -target module.kube-hetzner.module.control_planes["0-0-control-plane-*"].hcloud_server.server
Now do the same thing for the snapshot: terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
Runterraform plan and you should see that it wants to create the first control plane node again and the snapshot, that's what we want.
Terraform apply.

bulnv commented 1 year ago

@mysticaltech thank you so much for a detailed answer (as usual). Everything seems clear. Will try it out.

bulnv commented 1 year ago

@mysticaltech just tried what you mentioned and have some updates. I've successfully scaled masters up to 3 nodes. But when I am trying to destroy the first node I've got in a plan to destroy 29 resources including null resources for all the nodes, the load balancer, and all my helm releases. So I guess this terraform action can just blow up the cluster. So I did next:

I went to hetzner console and made a snapshot of the fresh master node
got a new snapshot ID
changed HCLOUD_SNAPSHOT value in auto-scaler deployment and rolled it out
scaled new node with new autoscaler and got the same result

looking forward to hearing from you any ideas

bulnv commented 1 year ago

only one thing, I am still on a relatively fresh "1.9.8" version. Does this make sense?

mysticaltech commented 1 year ago

@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.

bulnv commented 1 year ago

@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.

Nonono =)), I mean the problem persists even with different snapshot, freshly taken from the different master!

mysticaltech commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

mysticaltech commented 1 year ago

Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).

bulnv commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that

bulnv commented 1 year ago

Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).

ok, will try out

mysticaltech commented 1 year ago

So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] and recreating via terraform did not work correct?

i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that

Impossible, that will just delete the autoscaler image! Please try again, or post the plan to prove me wrong haha.

mysticaltech commented 1 year ago

But upgrade to 1.10.3 please!

mysticaltech commented 1 year ago

Change the version and then terraform init -upgrade.

bulnv commented 1 year ago

Change the version and then terraform init -upgrade.

Sure thing, on my way

bulnv commented 1 year ago

@mysticaltech sorry for confusion. i've managed to left my helms untouched, here is the actual plan. Does it looks safe and gonna remove only one master node?


Terraform will perform the following actions:

  # local_file.kubeconfig will be destroyed
  - resource "local_file" "kubeconfig" {
      - content              = (sensitive value) -> null
      - directory_permission = "0777" -> null
      - file_permission      = "0777" -> null
      - filename             = "/home/nbuashev/.kube/hetzner" -> null
      - id                   = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
    }

  # module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] will be destroyed
  - resource "hcloud_snapshot" "autoscaler_image" {
      - description = "Initial snapshot used for autoscaler" -> null
      - id          = "103210203" -> null
      - labels      = {
          - "autoscaler"  = "true"
          - "cluster"     = "botbiz-production-k8s"
          - "engine"      = "k3s"
          - "provisioner" = "terraform"
        } -> null
      - server_id   = 29666577 -> null
    }

  # module.kube-hetzner.local_file.kustomization_backup[0] will be destroyed
  - resource "local_file" "kustomization_backup" {
      - content              = <<-EOT
            "apiVersion": "kustomize.config.k8s.io/v1beta1"
            "kind": "Kustomization"
            "patchesStrategicMerge":
            - |
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: system-upgrade-controller
                namespace: system-upgrade
              spec:
                template:
                  spec:
                    containers:
                      - name: system-upgrade-controller
                        volumeMounts:
                          - name: ca-certificates
                            mountPath: /var/lib/ca-certificates
                    volumes:
                      - name: ca-certificates
                        hostPath:
                          path: /var/lib/ca-certificates
                          type: Directory
            - "kured.yaml"
            - "ccm.yaml"
            - "calico.yaml"
            "resources":
            - "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.14.1/ccm-networks.yaml"
            - "https://github.com/weaveworks/kured/releases/download/1.12.2/kured-1.12.2-dockerhub.yaml"
            - "https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml"
            - "hcloud-csi.yml"
            - "traefik_ingress.yaml"
            - "https://raw.githubusercontent.com/projectcalico/calico/v3.25.0/manifests/calico.yaml"
            - "cert_manager.yaml"
        EOT -> null
      - content_base64sha256 = "IfadrAeHqYq5kmPHEYPl2ja5LLhrkM+BNPqUZm3MYuE=" -> null
      - content_base64sha512 = "DHwgzdaNht95Zb2kiBKCoS0kqyKBsg/kId3CmXmL1LWhysrjFMe7ReuhCUsokJq5IbSFFcfVe787uiQ2OsvM5Q==" -> null
      - content_md5          = "f1b53f4455e272a9104a45e823ded9fe" -> null
      - content_sha1         = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
      - content_sha256       = "21f69dac0787a98ab99263c71183e5da36b92cb86b90cf8134fa94666dcc62e1" -> null
      - content_sha512       = "0c7c20cdd68d86df7965bda4881282a12d24ab2281b20fe421ddc299798bd4b5a1cacae314c7bb45eba1094b28909ab921b48515c7d57bbf3bba24363acbcce5" -> null
      - directory_permission = "0777" -> null
      - file_permission      = "600" -> null
      - filename             = "botbiz-production-k8s_kustomization_backup.yaml" -> null
      - id                   = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
    }

  # module.kube-hetzner.local_sensitive_file.kubeconfig[0] will be destroyed
  - resource "local_sensitive_file" "kubeconfig" {
      - content              = (sensitive value)
      - directory_permission = "0700" -> null
      - file_permission      = "600" -> null
      - filename             = "botbiz-production-k8s_kubeconfig.yaml" -> null
      - id                   = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
    }

  # module.kube-hetzner.null_resource.agents["0-0-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "3012830062625593000" -> null
      - triggers = {
          - "agent_id" = "29666578"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-1-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "5793387825081006408" -> null
      - triggers = {
          - "agent_id" = "29666576"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-2-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "2083069504276235938" -> null
      - triggers = {
          - "agent_id" = "29915387"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-3-agent"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "4994720741323820131" -> null
      - triggers = {
          - "agent_id" = "29915384"
        } -> null
    }

  # module.kube-hetzner.null_resource.configure_autoscaler[0] will be destroyed
  - resource "null_resource" "configure_autoscaler" {
      - id       = "5520672443837788890" -> null
      - triggers = {
          - "template" = >>>EOT
            EOT
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-0-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "7282537881054730684" -> null
      - triggers = {
          - "control_plane_id" = "29666577"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-1-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "7092987203276066952" -> null
      - triggers = {
          - "control_plane_id" = "30069994"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-2-control-plane"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "2153595407490067548" -> null
      - triggers = {
          - "control_plane_id" = "30069993"
        } -> null
    }

  # module.kube-hetzner.null_resource.first_control_plane will be destroyed
  - resource "null_resource" "first_control_plane" {
      - id = "4534639624935671890" -> null
    }

  # module.kube-hetzner.null_resource.kustomization will be destroyed
  - resource "null_resource" "kustomization" {
      - id       = "2531306948694365826" -> null
      - triggers = {
          - "helm_values_yaml" = (sensitive value)
          - "options"          = ""
          - "versions"         = <<-EOT
                v1.25.0
                N/A
                N/A
                N/A
                N/A
            EOT
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server.server will be destroyed
  - resource "hcloud_server" "server" {
      - allow_deprecated_images    = false -> null
      - backups                    = false -> null
      - datacenter                 = "nbg1-dc3" -> null
      - delete_protection          = false -> null
      - firewall_ids               = [
          - 764142,
        ] -> null
      - id                         = "29666577" -> null
      - ignore_remote_firewall_ids = false -> null
      - image                      = "ubuntu-20.04" -> null
      - ipv4_address               = "5.75.160.152" -> null
      - ipv6_address               = "2a01:4f8:c2c:fe9c::1" -> null
      - ipv6_network               = "2a01:4f8:c2c:fe9c::/64" -> null
      - keep_disk                  = false -> null
      - labels                     = {
          - "cluster"     = "botbiz-production-k8s"
          - "engine"      = "k3s"
          - "provisioner" = "terraform"
          - "role"        = "control_plane_node"
        } -> null
      - location                   = "nbg1" -> null
      - name                       = "botbiz-production-k8s-control-plane-jlr" -> null
      - placement_group_id         = 134623 -> null
      - rebuild_protection         = false -> null
      - rescue                     = "linux64" -> null
      - server_type                = "cpx31" -> null
      - ssh_keys                   = [
          - "10384165",
        ] -> null
      - status                     = "running" -> null
      - user_data                  = "/tzgSRzy9MJHfeGib23VfLR+aaU=" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server_network.server will be destroyed
  - resource "hcloud_server_network" "server" {
      - alias_ips   = [] -> null
      - id          = "29666577-2605336" -> null
      - ip          = "10.255.0.101" -> null
      - mac_address = "86:00:00:3b:4f:5d" -> null
      - server_id   = 29666577 -> null
      - subnet_id   = "2605336-10.255.0.0/16" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].null_resource.registries will be destroyed
  - resource "null_resource" "registries" {
      - id       = "2799499625627435179" -> null
      - triggers = {
          - "registries" = " "
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.identity_file will be destroyed
  - resource "random_string" "identity_file" {
      - id          = "dc1ohfonfm1ezyf9b0a3" -> null
      - length      = 20 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = true -> null
      - numeric     = true -> null
      - result      = "dc1ohfonfm1ezyf9b0a3" -> null
      - special     = false -> null
      - upper       = false -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.server will be destroyed
  - resource "random_string" "server" {
      - id          = "jlr" -> null
      - keepers     = {
          - "name" = "botbiz-production-k8s-control-plane"
        } -> null
      - length      = 3 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = false -> null
      - numeric     = false -> null
      - result      = "jlr" -> null
      - special     = false -> null
      - upper       = false -> null
    }

Plan: 0 to add, 0 to change, 19 to destroy.```

mysticaltech commented 1 year ago

@bulnv The only thing important now is the autoscaler_image, so basically you add helm releases, those are not from the module, make sure they are not dependent on the autoscaler_image, there is an implicit or explicit dependency somewhere.

mysticaltech commented 1 year ago

Please just destroy module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], forget about the first control plane. And find the implicit or explicit dependencies that create the need to destroy your helm releases. You fix this, and you recreate the snapshot through terraform and it will be fixed for good!

mysticaltech commented 1 year ago

@bulnv The reason why the autoscaler image is important to recreate is that the new version of the project holds the right cloud-init for it. So normally the first control plane node doesn't matter, as the new cloud-init reconfigures everything the right way so that k3s starts correctly on the autoscaled node!

mysticaltech commented 1 year ago

(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).

If you have access to it yourself, the secret is to pass it in multiple chuncks.

bulnv commented 1 year ago

(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).

If you have access to it yourself, the secret is to pass it in multiple chuncks.

hah, that's a good one actually! luckily I was able destroy it by myself. And apply it. Before that I've tried to use hack with changing HCLOUD_IMAGE. All what I've get so far: node is booting up, k3s failed, but at least installed. Here is the output of the k3s unit log

 systemctl[3305]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
 k3s[3308]: time="2023-03-16T17:26:29Z" level=fatal msg="--token is required"
 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE
 systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
 systemd[1]: Failed to start Lightweight Kubernetes.

mysticaltech commented 1 year ago

@bulnv Good progress. Now we need to figure out what's happening.

Please cd /etc/rancher/k3s/, cat the config file and see if the token is present and if everything looks ok.

Also have a look at /var/pre_install/ see the content of the file there.

Run ip address show, see if eth1 is present.

Post the output of journalctl -u k3s-agent.

And also the content of /var/log/cloud-init/cloud-init-user.log (if I remember correctly the path).

Somewhere above the problem should show up, please share the output if needed.

bulnv commented 1 year ago

@bulnv Good progress. Now we need to figure out what's happening.

Please cd /etc/rancher/k3s/, cat the config file and see if the token is present and if everything looks ok.

Also have a look at /var/pre_install/ see the content of the file there.

Run ip address show, see if eth1 is present.

Post the output of journalctl -u k3s-agent.

And also the content of /var/log/cloud-init/cloud-init-user.log (if I remember correctly the path).

Somewhere above the problem should show up, please share the output if needed.

Heh! Working on it right now! what I've figured out so far:

/etc/rancher/k3s/ missing
/etc/rancher/k3s/config.yaml missing as well
content: ${base64encode(k3s_config)} encoding: base64 path: /etc/rancher/k3s/config.yaml
content: ${base64encode(k3s_registries)} encoding: base64 path: /etc/rancher/k3s/registries.yaml
```
maybe this parg of decoded HCLOUD_CLOUD_INIT somehow involved
```


❯ ip a
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host 
   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 96:00:02:02:02:03 brd ff:ff:ff:ff:ff:ff
altname enp1s0
inet 128.140.3.5/32 scope global dynamic noprefixroute eth0
   valid_lft 85604sec preferred_lft 85604sec
inet6 fe80::5493:ec5b:9b3e:4337/64 scope link noprefixroute 
   valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
link/ether 86:00:00:3d:27:c0 brd ff:ff:ff:ff:ff:ff
altname enp7s0
inet 10.0.0.2/32 scope global dynamic noprefixroute eth1
   valid_lft 85604sec preferred_lft 85604sec
inet6 fe80::c3a:f0de:792d:1f79/64 scope link noprefixroute 
   valid_lft forever preferred_lft forever

- journalctl -eu k3s-agent I gave in the previous message, nothing besides that

9 sh[1365]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 9 systemctl[1366]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory 9 k3s[1369]: time="2023-03-16T20:42:19Z" level=fatal msg="--token is required" 9 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE 9 systemd[1]: k3s-agent.service: Failed with result 'exit-code'. lines 1-64


- cat /var/log/cloud-init-output.log

Cloud-init v. 23.1-1.1 running 'init-local' at Mon, 06 Mar 2023 15:32:36 +0000. Up 15.54 seconds. 2023-03-06 15:32:41,671 - util.py[WARNING]: failed stage init-local failed run of stage init-local

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config return self.distro.apply_network_config( File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 278, in apply_network_config self._write_network_state(network_state, renderer) File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 167, in _write_network_state renderer.render_network_state(network_state) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state self.create_network_file(k, v, network_dir) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file util.chownbyname(net_fn, net_fn_owner, net_fn_owner) File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname raise OSError("Unknown user or group: %s" % (e)) from e OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"

Cloud-init v. 23.1-1.1 running 'init' at Mon, 06 Mar 2023 15:32:42 +0000. Up 21.54 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info:	Device	Up	Address	Mask	Scope	Hw-Address	ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info:	eth0	True	5.75.160.152	255.255.255.255	global	96:00:01:f9:ab:df	ci-info:	eth0	True	fe80::c784:df14:b615:4bfb/64	.	link	96:00:01:f9:ab:df	ci-info:	lo	True	127.0.0.1	255.0.0.0	host	.	ci-info:	lo	True	::1/128	.	host	.	ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info:	Route	Destination	Gateway	Genmask	Interface	Flags	ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info:	0	0.0.0.0	172.31.1.1	0.0.0.0	eth0	UG	ci-info:	1	172.31.1.1	0.0.0.0	255.255.255.255	eth0	UH	ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info:	Route	Destination	Gateway	Interface	Flags	ci-info: +-------+-------------+---------+-----------+-------+ ci-info:	1	fe80::/64	::	eth0	U	ci-info:	3	multicast	::	eth0	U	ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:/mX33L/ySw0Fv5nifffjQzGBsCQYhY+qE+SmI7Y7S00 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[RSA 3072]----+	.=o o. o		o o .. +		o . +		. . . ++		oE . S ..+o		o+ . . . oo.		.o.o . o.oo+		o= o . o o.==		o+* . . +=%	+----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:Z2i1kCPIAJY1saWeE7wYAXySEjl6BYJv51LAKZB62cI root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[DSA 1024]----+	@O**..		O*+. .		+*oXo . + .		o E B . = .		= X S +		. o . o		.

+----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:GDYiQdnLkURLtEBLjJYfsdOdjqUmUqpFPvtCNTfJH1Y root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[ECDSA 256]---+	=BB*.		.+o**o. . E		. ====+= .		ooo*oX+o		o+o *.=S.		..ooo .		...		..		..	+----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:Bu2s6sozf8LH6gzDpSCZ7RoqzbG4Z/6CwPTxfL8q9F4 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +--[ED25519 256]--+
.
. .
= . +
B o = S
o=.+ + +
o=*+..+ .E
=+X+oo+...
==+XO=oo...

+----[SHA256]-----+ Cloud-init v. 23.1-1.1 running 'modules:config' at Mon, 06 Mar 2023 15:32:44 +0000. Up 23.62 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Mon, 06 Mar 2023 15:32:44 +0000. Up 24.00 seconds. sed: can't read /etc/sysconfig/network/config: No such file or directory sed: can't read /etc/sysconfig/network/dhcp: No such file or directory sed: can't read /etc/sysconfig/network/config: No such file or directory Removed "/etc/systemd/system/multi-user.target.wants/rebootmgr.service". Cloud-init v. 23.1-1.1 finished at Mon, 06 Mar 2023 15:32:45 +0000. Datasource DataSourceHetzner. Up 24.48 seconds Cloud-init v. 23.1-1.1 running 'init-local' at Mon, 06 Mar 2023 15:33:51 +0000. Up 6.36 seconds. Cloud-init v. 23.1-1.1 running 'init' at Mon, 06 Mar 2023 15:33:51 +0000. Up 7.11 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | eth0 | True | 5.75.160.152 | 255.255.255.255 | global | 96:00:01:f9:ab:df | ci-info: | eth0 | True | fe80::c784:df14:b615:4bfb/64 | . | link | 96:00:01:f9:ab:df | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 1 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | eth0 | U | ci-info: | 3 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Cloud-init v. 23.1-1.1 running 'modules:config' at Mon, 06 Mar 2023 15:33:52 +0000. Up 7.71 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Mon, 06 Mar 2023 15:33:52 +0000. Up 8.09 seconds. Cloud-init v. 23.1-1.1 finished at Mon, 06 Mar 2023 15:33:52 +0000. Datasource DataSourceHetzner. Up 8.16 seconds Cloud-init v. 23.1-1.1 running 'init-local' at Thu, 16 Mar 2023 20:40:52 +0000. Up 4.58 seconds. 2023-03-16 20:40:57,704 - util.py[WARNING]: failed stage init-local failed run of stage init-local

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config return self.distro.apply_network_config( File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 278, in apply_network_config self._write_network_state(network_state, renderer) File "/usr/lib/python3.10/site-packages/cloudinit/distros/init.py", line 167, in _write_network_state renderer.render_network_state(network_state) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state self.create_network_file(k, v, network_dir) File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file util.chownbyname(net_fn, net_fn_owner, net_fn_owner) File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname raise OSError("Unknown user or group: %s" % (e)) from e OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"

Cloud-init v. 23.1-1.1 running 'init' at Thu, 16 Mar 2023 20:40:58 +0000. Up 10.53 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | enp7s0 | True | 10.0.0.2 | 255.255.255.255 | global | 86:00:00:3d:27:c0 | ci-info: | enp7s0 | True | fe80::58a2:41c7:301b:10b0/64 | . | link | 86:00:00:3d:27:c0 | ci-info: | eth0 | True | 128.140.3.5 | 255.255.255.255 | global | 96:00:02:02:02:03 | ci-info: | eth0 | True | fe80::9cfa:c039:f3b7:6596/64 | . | link | 96:00:02:02:02:03 | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | enp7s0 | UG | ci-info: | 1 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 2 | 10.0.0.0 | 10.0.0.1 | 255.0.0.0 | enp7s0 | UG | ci-info: | 3 | 10.0.0.1 | 0.0.0.0 | 255.255.255.255 | enp7s0 | UH | ci-info: | 4 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | enp7s0 | U | ci-info: | 2 | fe80::/64 | :: | eth0 | U | ci-info: | 4 | multicast | :: | enp7s0 | U | ci-info: | 5 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:D1VnepUBoe8uK/89aUMStLs7RBnRRcejBp/yyTakvEQ root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[RSA 3072]----+ | .+B+O| | o. =o| | ..= B .| | . E.% | | S o O.+ | | o +.X . | | o +.= .| | . o.o.= | | oo+++.o| +----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:TbbywS2yYdG+zXzJ9h5tcRazpmR4TUsslaTBWx2eP34 root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[DSA 1024]----+ | ...++| | . .oo| | . + ooO | | B o..= | | S B..+ B+| | . B+.+o=| | . o +.=oE| | o oo| | .o| +----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:5IThG4cF8lX1ZxNt8c+MwcTdf276z+I0iwPLLqNomnU root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[ECDSA 256]---+ | . o.o......++| | + +. B| | + +o=| | B B=| | . S ..=| | . o| | . E . o oo | | o.o o o .oo+ | | oo. .. +. .ooo=| +----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:qA2W9qZ4VStAwiZoKPG76ZBLqDu1i1s0GjcqwlLL+6U root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +--[ED25519 256]--+ |+o | |=o+ . | |oo.o | | ... .. | |. B =...S. | |oX.O =o . | |X+...=. | |B.o.= | |*+=+E | +----[SHA256]-----+ Cloud-init v. 23.1-1.1 running 'modules:config' at Thu, 16 Mar 2023 20:40:59 +0000. Up 11.62 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Thu, 16 Mar 2023 20:40:59 +0000. Up 12.01 seconds. ++ id -u

'[' 0 -eq 0 ']'
/usr/local/bin/k3s-killall.sh
for service in /etc/systemd/system/k3s*.service
'[' -s /etc/systemd/system/k3s.service ']' ++ basename /etc/systemd/system/k3s.service
systemctl stop k3s.service
for service in /etc/init.d/k3s*
'[' -x '/etc/init.d/k3s*' ']'
killtree
kill -9
do_unmount_and_remove /run/k3s
set +x
do_unmount_and_remove /var/lib/rancher/k3s
set +x
do_unmount_and_remove /var/lib/kubelet/pods
set +x
do_unmount_and_remove /var/lib/kubelet/plugins
set +x
do_unmount_and_remove /run/netns/cni-
set +x
grep cni-
xargs -r -t -n 1 ip netns delete
ip netns show
ip link show
grep 'master cni0'
read ignore iface ignore
ip link delete cni0 Cannot find device "cni0"
ip link delete flannel.1 Cannot find device "flannel.1"
ip link delete flannel-v6.1 Cannot find device "flannel-v6.1"
ip link delete kube-ipvs0 Cannot find device "kube-ipvs0"
ip link delete flannel-wg Cannot find device "flannel-wg"
ip link delete flannel-wg-v6 Cannot find device "flannel-wg-v6"
rm -rf /var/lib/cni/
iptables-save
grep -v KUBE-
grep -iv flannel
iptables-restore
grep -v CNI-
ip6tables-save
grep -v KUBE-
grep -iv flannel
grep -v CNI-
ip6tables-restore
command -v systemctl /usr/bin/systemctl
systemctl disable k3s Removed "/etc/systemd/system/multi-user.target.wants/k3s.service".
systemctl reset-failed k3s Failed to reset failed state of unit k3s.service: Unit k3s.service not loaded.
systemctl daemon-reload
command -v rc-update
rm -f /etc/systemd/system/k3s.service
rm -f /etc/systemd/system/k3s.service.env
trap remove_uninstall EXIT
for cmd in kubectl crictl ctr
'[' -L /usr/local/bin/kubectl ']'
rm -f /usr/local/bin/kubectl
for cmd in kubectl crictl ctr
'[' -L /usr/local/bin/crictl ']'
rm -f /usr/local/bin/crictl
for cmd in kubectl crictl ctr
'[' -L /usr/local/bin/ctr ']'
rm -f /usr/local/bin/ctr
rm -rf /etc/rancher/k3s
rm -rf /run/k3s
rm -rf /run/flannel
rm -rf /var/lib/rancher/k3s
rm -rf /var/lib/kubelet
rm -f /usr/local/bin/k3s
rm -f /usr/local/bin/k3s-killall.sh
type yum
type zypper
uninstall_cmd='zypper remove -y k3s-selinux'
'[' false '!=' true ']'
'[' -x /usr/sbin/transactional-update ']'
uninstall_cmd='transactional-update --no-selfupdate -d run zypper remove -y k3s-selinux'
transactional-update --no-selfupdate -d run zypper remove -y k3s-selinux transactional-update 4.1.3 started Options: --no-selfupdate -d run zypper remove -y k3s-selinux Separate /var detected. 2023-03-16 20:41:01 tukit 4.1.3 started 2023-03-16 20:41:01 Options: --discard -c3 open 2023-03-16 20:41:01 Using snapshot 3 as base for new snapshot 4. 2023-03-16 20:41:01 Syncing /etc of previous snapshot 2 as base into new snapshot "/.snapshots/4/snapshot" 2023-03-16 20:41:01 SELinux is enabled. /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Relabeled /var/lib/rancher/k3s from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0 Relabeled /var/lib/rancher/k3s/agent from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0 Relabeled /var/lib/rancher/k3s/agent/containerd from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0 Relabeled /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:container_var_lib_t:s0 /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Relabeled /var/lib/rancher/k3s/data from unconfined_u:object_r:var_lib_t:s0 to unconfined_u:object_r:k3s_data_t:s0 ID: 4 2023-03-16 20:41:02 Transaction completed. 2023-03-16 20:41:02 tukit 4.1.3 started 2023-03-16 20:41:02 Options: --discard call 4 zypper remove -y k3s-selinux /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 2023-03-16 20:41:03 Executing zypper remove -y k3s-selinux: Reading installed packages... Resolving package dependencies...

The following package is going to be REMOVED: k3s-selinux

1 package to remove. After the operation, 95.0 KiB will be freed. Continue? [y/n/v/...? shows all options] (y): y (1/1) Removing k3s-selinux-0.0~bd1f1455dirty-0.sle.noarch [...done]

2023-03-16 20:41:10 Application returned with exit status 0. 2023-03-16 20:41:11 Transaction completed. 2023-03-16 20:41:11 tukit 4.1.3 started 2023-03-16 20:41:11 Options: --discard close 4 /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 2023-03-16 20:41:11 New default snapshot is #4 (/.snapshots/4/snapshot). 2023-03-16 20:41:11 Transaction completed.

Please reboot your machine to activate the changes and avoid data loss. New default snapshot is #4 (/.snapshots/4/snapshot). transactional-update finished

rm -f /etc/zypp/repos.d/rancher-k3s-common.repo
remove_uninstall
rm -f /usr/local/bin/k3s-uninstall.sh Resize device id 1 (/dev/sda4) from 133.60GiB to max
sleep 11 ++ ip link show ++ awk '/^3:/{print $2}' ++ sed s/://g
INTERFACE=enp7s0 ++ cat /sys/class/net/enp7s0/address
MAC=86:00:00:3d:27:c0
cat Created symlink /etc/systemd/system/network-online.target.wants/install-k3s-agent.service → /etc/systemd/system/install-k3s-agent.service. Cloud-init v. 23.1-1.1 finished at Thu, 16 Mar 2023 20:41:23 +0000. Datasource DataSourceHetzner. Up 35.86 seconds Cloud-init v. 23.1-1.1 running 'init-local' at Thu, 16 Mar 2023 20:41:40 +0000. Up 5.06 seconds. Cloud-init v. 23.1-1.1 running 'init' at Thu, 16 Mar 2023 20:41:41 +0000. Up 5.84 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | eth0 | True | 128.140.3.5 | 255.255.255.255 | global | 96:00:02:02:02:03 | ci-info: | eth0 | True | fe80::5493:ec5b:9b3e:4337/64 | . | link | 96:00:02:02:02:03 | ci-info: | eth1 | True | 10.0.0.2 | 255.255.255.255 | global | 86:00:00:3d:27:c0 | ci-info: | eth1 | True | fe80::c3a:f0de:792d:1f79/64 | . | link | 86:00:00:3d:27:c0 | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 1 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth1 | UG | ci-info: | 2 | 10.0.0.0 | 10.0.0.1 | 255.0.0.0 | eth1 | UG | ci-info: | 3 | 10.0.0.1 | 0.0.0.0 | 255.255.255.255 | eth1 | UH | ci-info: | 4 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | eth0 | U | ci-info: | 2 | fe80::/64 | :: | eth1 | U | ci-info: | 4 | multicast | :: | eth0 | U | ci-info: | 5 | multicast | :: | eth1 | U | ci-info: +-------+-------------+---------+-----------+-------+ Cloud-init v. 23.1-1.1 running 'modules:config' at Thu, 16 Mar 2023 20:41:42 +0000. Up 6.40 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Thu, 16 Mar 2023 20:41:42 +0000. Up 6.78 seconds. Cloud-init v. 23.1-1.1 finished at Thu, 16 Mar 2023 20:41:42 +0000. Datasource DataSourceHetzner. Up 6.85 seconds

bulnv commented 1 year ago

my guess is ${base64encode(k3s_config)} is not working inside of template file

mysticaltech commented 1 year ago

It could be, but it's super weird because it works for in all my tests and for others too! 🤯

Please debug some more, at this point I am dry! If I can think of something I will let you know.

bulnv commented 1 year ago

It could be, but it's super weird because it works for in all my tests and for others too! exploding_head

Please debug some more, at this point I am dry! If I can think of something I will let you know.

yeah, changed tmpl and redeployed, base64 looks valid, and this is strange. Looks like I missing something

mysticaltech commented 1 year ago

@bulnv Try rebooting the node manually, see if it comes online!

mysticaltech commented 1 year ago

It should reboot after cloud-init, maybe it your case maybe it's not rebooting, so maybe use the reboot command at the end of user-config, instead of the power state statement.

mysticaltech commented 1 year ago

I am pretty convinced, if you reboot, it will join the cluster. I'm guessing the power state did not work for that old version of MicroOS. So we can use the reboot statement instead.

mysticaltech commented 1 year ago

Also run journalctl -u install-k3s-agent please.

bulnv commented 1 year ago

Also run journalctl -u install-k3s-agent please.

reboot hasn't helped, but its working, it's rebooting like 3 times, as I can notice from the console
I think I've found the issue, but I am not that deeply aware of all the stuff in this huge module
tomorrow will send you a PR

as usual thank a bunch for help

cpx31-43ea3b9393f93f33   Ready                      <none>                      51s    v1.25.7+k3s1

mysticaltech commented 1 year ago

Good to hear @bulnv, PR is most welcome, am curious about this, it could be an edge case!

bulnv commented 1 year ago

@mysticaltech please see my changes here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/660 yes, i agree this could be some edge case. But as far as I understood from the code the flow is following:

we take a snapshot from the node with installed k3s
copy configs
run Uninstall k3s if it persist on the node
install k3s with previously copied configs.

IMO copy configs goes bit early, and following uninstall-k3s script removes them (as I checked in the uninstall script code) https://github.com/ruifigueiredo/k3s/blob/master/uninstall.sh#L21 please see line 21. So I made a change to preserve configs from deletion during k3s-uninstall. Let me know WDYT?

mysticaltech commented 1 year ago

@bulnv Yep exactly, that's what's happening. But in my testing it worked, so the timing must change depending on the machine, hence the bug went un-noticed until it showed up for you. Good catch!!

mysticaltech commented 1 year ago

@bulnv Your fix was just released in 1.10.6 🙏🚀

kube-hetzner / terraform-hcloud-kube-hetzner