vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.8k stars 134 forks source link

hcloud-csi-node pod crashing on two master nodes after upgrading to 2.0.5 #424

Closed domvie closed 1 month ago

domvie commented 1 month ago

Hi,

first of all thanks for this great tool.

I am running into some issues after upgrading from 1.5.1 to 2.0.5 on one of my clusters. The test cluster upgrade went just fine so not sure whats going on here.

2/3 of my master nodes (master1/master2) fail to start container=csi-node-driver-registrar pod=hcloud-csi-node. It works just fine on master3 though.

These are the container logs of csi-node-driver-registrar:

2024-08-27T12:51:31.411175882Z I0827 12:51:31.411108       1 main.go:151] "Running node-driver-registrar" mode=""
2024-08-27T12:51:41.412471736Z I0827 12:51:41.412110       1 connection.go:253] "Still connecting" address="unix:///run/csi/socket"
2024-08-27T12:52:01.412954058Z E0827 12:52:01.412281       1 main.go:176] "Error connecting to CSI driver" err="context deadline exceeded"

and these are the logs of hcloud-csi-driver container:

2024-08-27T13:18:01.490972843Z level=warn ts=2024-08-27T13:18:01.490807976Z msg="unable to connect to metadata service, are you sure this is running on a Hetzner Cloud server?"
2024-08-27T13:18:01.491464556Z level=error ts=2024-08-27T13:18:01.491356627Z msg="failed to fetch server ID from metadata service" err="Get \"http://169.254.169.254/hetzner/v1/metadata/instance-id\": dial tcp 169.254.169.254:80: connect: connection refused"

hcloud-csi-controller was also crashing after update until I manually deleted the pod - then it worked again.

k3s version: 1.30.4+k3s1 node instance types: cx21 (deprecated by now)

Any ideas what might be going on here? Restart of the nodes did not help.

Should I maybe try to change the master nodes to cx22? If so, is there a tutorial somewhere that describes how one can upgrade the master nodes?

Edit: Some more findings. The two problematic nodes, for whatever reason, can not reach Hetzners metadata endpoint @ http://169.254.169.254/hetzner/v1/metadata. ping @ 169.254.169.254 works for all, but this endpoint throws Connection refused on the problematic nodes.

vitobotta commented 1 month ago

Hi, can you please share the 1.5.1 version of the configuration and the updated one? Remember to remove the token. :)

domvie commented 1 month ago

Hi, sure! I went from this:

# 1.5.1
# hetzner_token: # <from HCLOUD_TOKEN env var>
cluster_name: cluster-1
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.0+k3s1
public_ssh_key_path: "~/.ssh/id_rsa.pub"
private_ssh_key_path: "~/.ssh/id_rsa"
use_ssh_agent: false
ssh_allowed_networks:
  - 0.0.0.0/0
api_allowed_networks:
  - 0.0.0.0/0
private_network_subnet: 10.0.0.0/16
disable_flannel: false
schedule_workloads_on_masters: false
datastore:
  mode: etcd # etcd (default) or external
  # external_datastore_endpoint: postgres://....
masters_pool:
  instance_type: cx21
  instance_count: 3
  location: nbg1 # nuremberg germany
worker_node_pools:
  - name: small-static-arm64
    instance_type: cax21
  - name: mid-static-arm64
    instance_type: cax31
    instance_count: 2
post_create_commands:
  - apt update
  - apt upgrade -y
  - apt autoremove -y

to this:

# 2.0.5
# hetzner_token: # <from HCLOUD_TOKEN env var>
cluster_name: cluster-1
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.0+k3s1
public_ssh_key_path: "~/.ssh/id_rsa.pub"
private_ssh_key_path: "~/.ssh/id_rsa"

networking:
  use_ssh_agent: false
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled: true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

datastore:
  mode: etcd # etcd (default) or external
  # external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

include_instance_type_in_instance_name: true

masters_pool:
  instance_type: cx21
  instance_count: 3
  location: nbg1 # nuremberg germany

worker_node_pools:
  - name: small-static-arm64
    instance_type: cax21
  - name: mid-static-arm64
    instance_type: cax31
    instance_count: 2

embedded_registry_mirror:
  enabled: true

post_create_commands:
  - apt update
  - apt upgrade -y
  - apt autoremove -y

I've also added the /etc/k8s-resolv.conf with nameserver 8.8.8.8 to each node. Afterwards, I deleted the k8s API LB from Hetzner.

Everything else seems to have worked just fine, but for whatever reason these two nodes get rejected from Hetzner's metadata endpoint.

Finally I might as well dump the create.log, which didn't seem to report any issues:

[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[Private Network] Private network already exists, skipping create
[SSH key] SSH key already exists, skipping create
[Placement groups] Creating placement group cluster-1-mid-static-arm64-7...
[Placement groups] Creating placement group cluster-1-small-static-arm64-7...
[Placement groups] ...placement group cluster-1-small-static-arm64-7 created
[Placement groups] ...placement group cluster-1-mid-static-arm64-7 created
[Instance cluster-1-cx21-master1] Instance cluster-1-cx21-master1 already exists, skipping create
[Instance cluster-1-cx21-master2] Instance cluster-1-cx21-master2 already exists, skipping create
[Instance cluster-1-cx21-master3] Instance cluster-1-cx21-master3 already exists, skipping create
[Instance cluster-1-cx21-master1] Instance status: running
[Instance cluster-1-cx21-master2] Instance status: running
[Instance cluster-1-cx21-master3] Instance status: running
[Instance cluster-1-cx21-master1] Waiting for successful ssh connectivity with instance cluster-1-cx21-master1...
[Instance cluster-1-cx21-master2] Waiting for successful ssh connectivity with instance cluster-1-cx21-master2...
[Instance cluster-1-cx21-master3] Waiting for successful ssh connectivity with instance cluster-1-cx21-master3...
[Instance cluster-1-cx21-master2] ...instance cluster-1-cx21-master2 is now up.
[Instance cluster-1-cx21-master3] ...instance cluster-1-cx21-master3 is now up.
[Instance cluster-1-cx21-master1] ...instance cluster-1-cx21-master1 is now up.
[Firewall] Updating firewall...
[Firewall] ...firewall updated
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] Instance cluster-1-cax21-pool-small-static-arm64-worker1 already exists, skipping create
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] Instance cluster-1-cax31-pool-mid-static-arm64-worker2 already exists, skipping create
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] Instance cluster-1-cax31-pool-mid-static-arm64-worker1 already exists, skipping create
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] Instance status: running
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] Instance status: running
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] Instance status: running
[Instance cluster-1-cx21-master1] Cloud init finished: 34.54 - Fri, 31 May 2024 13:28:27 +0000 - v. 23.4.4-0ubuntu0~22.04.1
[Instance cluster-1-cx21-master1] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cx21-master1] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-amd64.txt
[Instance cluster-1-cx21-master1] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cx21-master1] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cx21-master1] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cx21-master1] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cx21-master1] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cx21-master1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cx21-master1] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance cluster-1-cx21-master1] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance cluster-1-cx21-master1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance cluster-1-cx21-master1] [INFO]  systemd: Enabling k3s unit
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] Waiting for successful ssh connectivity with instance cluster-1-cax31-pool-mid-static-arm64-worker1...
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] Waiting for successful ssh connectivity with instance cluster-1-cax21-pool-small-static-arm64-worker1...
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] Waiting for successful ssh connectivity with instance cluster-1-cax31-pool-mid-static-arm64-worker2...
[Instance cluster-1-cx21-master1] [INFO]  systemd: Starting k3s
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] ...instance cluster-1-cax31-pool-mid-static-arm64-worker1 is now up.
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] ...instance cluster-1-cax21-pool-small-static-arm64-worker1 is now up.
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] ...instance cluster-1-cax31-pool-mid-static-arm64-worker2 is now up.
[Instance cluster-1-cx21-master1] Waiting for the control plane to be ready...
[Control plane] Generating the kubeconfig file to /Users/user/hetzner_cloud/kubeconfig...
Switched to context "cluster-1-cx21-master1".
[Control plane] ...kubeconfig file generated as /Users/user/hetzner_cloud/kubeconfig.
[Instance cluster-1-cx21-master1] ...k3s deployed
[Instance cluster-1-cx21-master3] Cloud init finished: 27.15 - Fri, 31 May 2024 13:28:22 +0000 - v. 23.4.4-0ubuntu0~22.04.1
[Instance cluster-1-cx21-master2] Cloud init finished: 29.20 - Fri, 31 May 2024 13:28:22 +0000 - v. 23.4.4-0ubuntu0~22.04.1
[Instance cluster-1-cx21-master2] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cx21-master2] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-amd64.txt
[Instance cluster-1-cx21-master3] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cx21-master3] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-amd64.txt
[Instance cluster-1-cx21-master2] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cx21-master2] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cx21-master2] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cx21-master2] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cx21-master2] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cx21-master2] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cx21-master2] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance cluster-1-cx21-master2] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance cluster-1-cx21-master2] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance cluster-1-cx21-master2] [INFO]  systemd: Enabling k3s unit
[Instance cluster-1-cx21-master3] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cx21-master3] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cx21-master3] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cx21-master3] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cx21-master3] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cx21-master3] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cx21-master3] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance cluster-1-cx21-master3] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance cluster-1-cx21-master3] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance cluster-1-cx21-master3] [INFO]  systemd: Enabling k3s unit
[Instance cluster-1-cx21-master2] [INFO]  systemd: Starting k3s
[Instance cluster-1-cx21-master3] [INFO]  systemd: Starting k3s
[Instance cluster-1-cx21-master2] ...k3s deployed
[Instance cluster-1-cx21-master3] ...k3s deployed
[Control plane] Generating the kubeconfig file to /Users/user/hetzner_cloud/kubeconfig...
Switched to context "cluster-1-cx21-master1".
[Control plane] ...kubeconfig file generated as /Users/user/hetzner_cloud/kubeconfig.
[Hetzner Cloud Secret] Creating secret for Hetzner Cloud token...
[Hetzner Cloud Secret] secret/hcloud configured
[Hetzner Cloud Secret] ...secret created
[Hetzner Cloud Controller] Installing Hetzner Cloud Controller Manager...
[Hetzner Cloud Controller] serviceaccount/hcloud-cloud-controller-manager unchanged
[Hetzner Cloud Controller] clusterrolebinding.rbac.authorization.k8s.io/system:hcloud-cloud-controller-manager unchanged
[Hetzner Cloud Controller] deployment.apps/hcloud-cloud-controller-manager configured
[Hetzner Cloud Controller] Hetzner Cloud Controller Manager installed
[Hetzner CSI Driver] Installing Hetzner CSI Driver...
[Hetzner CSI Driver] serviceaccount/hcloud-csi-controller unchanged
[Hetzner CSI Driver] storageclass.storage.k8s.io/hcloud-volumes unchanged
[Hetzner CSI Driver] clusterrole.rbac.authorization.k8s.io/hcloud-csi-controller unchanged
[Hetzner CSI Driver] clusterrolebinding.rbac.authorization.k8s.io/hcloud-csi-controller unchanged
[Hetzner CSI Driver] service/hcloud-csi-controller-metrics unchanged
[Hetzner CSI Driver] service/hcloud-csi-node-metrics unchanged
[Hetzner CSI Driver] daemonset.apps/hcloud-csi-node configured
[Hetzner CSI Driver] deployment.apps/hcloud-csi-controller configured
[Hetzner CSI Driver] csidriver.storage.k8s.io/csi.hetzner.cloud unchanged
[Hetzner CSI Driver] Hetzner CSI Driver installed
[System Upgrade Controller] Installing System Upgrade Controller...
[System Upgrade Controller] namespace/system-upgrade configured
[System Upgrade Controller] customresourcedefinition.apiextensions.k8s.io/plans.upgrade.cattle.io created
[System Upgrade Controller] clusterrole.rbac.authorization.k8s.io/system-upgrade-controller created
[System Upgrade Controller] role.rbac.authorization.k8s.io/system-upgrade-controller created
[System Upgrade Controller] clusterrole.rbac.authorization.k8s.io/system-upgrade-controller-drainer created
[System Upgrade Controller] clusterrolebinding.rbac.authorization.k8s.io/system-upgrade-drainer created
[System Upgrade Controller] clusterrolebinding.rbac.authorization.k8s.io/system-upgrade created
[System Upgrade Controller] rolebinding.rbac.authorization.k8s.io/system-upgrade created
[System Upgrade Controller] namespace/system-upgrade configured
[System Upgrade Controller] serviceaccount/system-upgrade unchanged
[System Upgrade Controller] configmap/default-controller-env unchanged
[System Upgrade Controller] deployment.apps/system-upgrade-controller configured
[System Upgrade Controller] ...System Upgrade Controller installed
[Cluster Autoscaler] Installing Cluster Autoscaler...
[Cluster Autoscaler] serviceaccount/cluster-autoscaler created
[Cluster Autoscaler] clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
[Cluster Autoscaler] role.rbac.authorization.k8s.io/cluster-autoscaler created
[Cluster Autoscaler] clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
[Cluster Autoscaler] rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
[Cluster Autoscaler] deployment.apps/cluster-autoscaler created
[Cluster Autoscaler] ...Cluster Autoscaler installed
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] Cloud init finished: 30.67 - Fri, 31 May 2024 13:28:32 +0000 - v. 23.4.4-0ubuntu0~22.04.1
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] Cloud init finished: 28.69 - Fri, 31 May 2024 13:28:31 +0000 - v. 23.4.4-0ubuntu0~22.04.1
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] Cloud init finished: 25.84 - Fri, 31 May 2024 13:28:27 +0000 - v. 24.1.3-0ubuntu1~22.04.1
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-arm64.txt
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-arm64.txt
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Using v1.30.0+k3s1 as release
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.0+k3s1/sha256sum-arm64.txt
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Skipping installation of SELinux RPM
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s-agent.service
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  systemd: Enabling k3s-agent unit
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  systemd: Creating service file /etc/systemd/system/k3s-agent.service
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  systemd: Enabling k3s-agent unit
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s-agent.service
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  systemd: Enabling k3s-agent unit
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] [INFO]  systemd: Starting k3s-agent
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] [INFO]  systemd: Starting k3s-agent
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] [INFO]  systemd: Starting k3s-agent
[Instance cluster-1-cax31-pool-mid-static-arm64-worker1] ...k3s has been deployed to worker cluster-1-cax31-pool-mid-static-arm64-worker1.
[Instance cluster-1-cax31-pool-mid-static-arm64-worker2] ...k3s has been deployed to worker cluster-1-cax31-pool-mid-static-arm64-worker2.
[Instance cluster-1-cax21-pool-small-static-arm64-worker1] ...k3s has been deployed to worker cluster-1-cax21-pool-small-static-arm64-worker1.
[Placement groups] Deleting unused placement group cluster-1-mid-static-arm64-7...
[Placement groups] ...placement group cluster-1-mid-static-arm64-7 deleted
[Placement groups] Deleting unused placement group cluster-1-small-static-arm64-7...
[Placement groups] ...placement group cluster-1-small-static-arm64-7 deleted

Do you have recommendations on how to change the master node instance types? Would it work to just change them from cx21 to cx22? At the very least I would like to test if a new instance/node also has the same issues.

Thank you very much for your help!

vitobotta commented 1 month ago

Hey, so this is a pretty weird issue you're dealing with. Your updated config file looks fine, and the instance type shouldn't really be causing problems, except for the fact that you can't create new instances of that type anymore. That could actually be a headache for clusters made with version 1.5.1, since the instance type is baked into the instance name. It means you can scale an existing instance to a different type no problem, but you're stuck when it comes to making new instances of the old type or just swapping out the instance type in the config for future setups. That's why in the new version of hetzner-k3s, I ditched including the instance type in the names. Honestly, when I first set up the naming system, I didn't think they'd retire some of the SKUs like this.

Now, about your specific problem - since you've already got your 3 master nodes and probably won't be adding more, you could try scaling the troublesome masters to switch up their instance type. Just remember not to change the type in the config file, or you'll end up with a mismatch between that and your existing master names. Not sure if this will fix your issue, but it's worth a shot.

Another simple trick you could try is just rebooting the problematic masters. Sometimes it's just a temporary network hiccup or something that a quick restart can fix. I've gotta say, I haven't run into this metadata API problem before, and no one else has reported it either, so I'm a bit stumped on what else to suggest. But starting with these ideas seems like a good plan.

domvie commented 1 month ago

I see. Thanks a lot for the help, I will try out some of the suggestions and report back if any of them deem successful.

The problem might even be unrelated to hetzner-k3s (although I have no idea how/why since I never really touched the masters after creation). Perhaps I'll crosspost this to the hetzner csi repository or contact Hetzner support.

As far as this issue goes I would say it can be closed.

vitobotta commented 1 month ago

OK, let me know how it goes :)

domvie commented 1 month ago

I'm happy to report that a manual shutdown & restart via the Hetzner Cloud UI (reboot was not enough) of the two master nodes seems to have solved the problem. Thanks again for your help.

vitobotta commented 1 month ago

Nice! Thanks for the update. Glad it's sorted.