vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.81k stars 137 forks source link

Upgrading from 1.1.5 to 2.0.8 - existing cluster #463

Closed boris-savic closed 2 days ago

boris-savic commented 2 days ago

Hi,

first thanks for the amazing work. I have setup a small test cluster with 1.1.5 version , that I tried to upgrade with 2.0.8 hetzner-k3s release.

I have followed the steps in the 2.0.0. release notes where i did:

Current behavior:

E1009 14:39:53.569566   60696 memcache.go:265] couldn't get current server API group list: Get "https://X.X.X.X:6443/api?timeout=32s": dial tcp X.X.X.X:6443: connect: connection refused

So right now I'm stuck with a broken state and I cant figure out what I did wrong :/

When I run the hetzner-k3s create command the process times out and the full log output is:

[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[Private Network] Private network already exists, skipping create
[SSH key] SSH key already exists, skipping create
[Placement groups] Deleting unused placement group test-cluster-database-small-static-3...
[Placement groups] ...placement group test-cluster-database-small-static-3 deleted
[Placement groups] Creating placement group test-cluster-database-small-static-3...
[Placement groups] ...placement group test-cluster-database-small-static-3 created
[Instance test-cluster-cpx21-master1] Instance test-cluster-cpx21-master1 already exists, skipping create
[Instance test-cluster-cpx21-master1] Instance status: running
[Instance test-cluster-cpx21-master1] Waiting for successful ssh connectivity with instance test-cluster-cpx21-master1...
[Instance test-cluster-cpx21-master1] ...instance test-cluster-cpx21-master1 is now up.
[Firewall] Updating firewall...
[Firewall] ...firewall updated
[Instance test-cluster-cpx21-pool-database-small-static-worker1] Instance test-cluster-cpx21-pool-database-small-static-worker1 already exists, skipping create
[Instance test-cluster-cpx21-pool-database-small-static-worker1] Instance status: running
[Instance test-cluster-cpx21-master1] Cloud init finished: 27.90 - Wed, 31 Jul 2024 15:40:10 +0000 - v. 24.1.3-0ubuntu1~22.04.5
[Instance test-cluster-cpx21-master1] Private network IP in subnet 10.0.0.0/16 is up
[Instance test-cluster-cpx21-master1] [INFO]  Using v1.29.0+k3s1 as release
[Instance test-cluster-cpx21-master1] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.29.0+k3s1/sha256sum-amd64.txt
[Instance test-cluster-cpx21-master1] [INFO]  Skipping binary downloaded, installed k3s matches hash
[Instance test-cluster-cpx21-master1] [INFO]  Skipping installation of SELinux RPM
[Instance test-cluster-cpx21-master1] [INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[Instance test-cluster-cpx21-master1] [INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[Instance test-cluster-cpx21-master1] [INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[Instance test-cluster-cpx21-master1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance test-cluster-cpx21-master1] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance test-cluster-cpx21-master1] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance test-cluster-cpx21-master1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance test-cluster-cpx21-master1] [INFO]  systemd: Enabling k3s unit
[Instance test-cluster-cpx21-master1] [INFO]  No change detected so skipping service start
[Instance test-cluster-cpx21-master1] Waiting for the control plane to be ready...
[Control plane] Generating the kubeconfig file to /workspace/kubeconfig...
[Instance test-cluster-cpx21-pool-database-small-static-worker1] Waiting for successful ssh connectivity with instance test-cluster-cpx21-pool-database-small-static-worker1...
Switched to context "test-cluster-cpx21-master1".
[Control plane] ...kubeconfig file generated as /workspace/kubeconfig.
[Instance test-cluster-cpx21-pool-database-small-static-worker1] ...instance test-cluster-cpx21-pool-database-small-static-worker1 is now up.
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /usr/lib/crystal/core/channel.cr:453:10 in 'timeout'
  from /home/runner/work/hetzner-k3s/hetzner-k3s/src/kubernetes/installer.cr:124:7 in 'run'
  from /usr/lib/crystal/core/fiber.cr:143:11 in 'run'
  from ???

My config file is:

cluster_name: test-cluster
kubeconfig_path: "kubeconfig"
k3s_version: v1.29.0+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: ""

datastore:
  mode: etcd 

schedule_workloads_on_masters: true

manifests:
   cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml"
   csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.8.0/deploy/kubernetes/hcloud-csi.yml"
   system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
   system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"

include_instance_type_in_instance_name: true

masters_pool:
  instance_type: cpx21
  instance_count: 1
  location: fsn1

worker_node_pools:
  - name: database-small-static
    instance_type: cpx21
    instance_count: 1
    location: fsn1
vitobotta commented 2 days ago

Hi! Can you share the 1.x version of the config file?

boris-savic commented 2 days ago
cluster_name: test-cluster
kubeconfig_path: "kubeconfig"
k3s_version: v1.29.0+k3s1
public_ssh_key_path: "~/.ssh/id_rsa.pub"
private_ssh_key_path: "~/.ssh/id_rsa"
use_ssh_agent: false # set to true if your key has a passphrase or if SSH connections don't work or seem to hang without agent. See https://github.com/vitobotta/hetzner-k3s#limitations
# ssh_port: 22
ssh_allowed_networks:
  - 0.0.0.0/0 # ensure your current IP is included in the range
api_allowed_networks:
  - 0.0.0.0/0 # ensure your current IP is included in the range
private_network_subnet: 10.0.0.0/16 # ensure this doesn't overlap with other networks in the same project
schedule_workloads_on_masters: true
cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml"
csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.8.0/deploy/kubernetes/hcloud-csi.yml"
system_upgrade_controller_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
masters_pool:
  instance_type: cpx21
  instance_count: 1
  location: fsn1
worker_node_pools:
  - name: database-small-static
    instance_type: cpx21
    instance_count: 1
    location: fsn1
vitobotta commented 2 days ago

How long ago have you created this cluster and when was last time you ran the create command before the upgrade to 2.x? Also are you using the very latest version, not 2.0.0?

boris-savic commented 2 days ago

Made the cluster couple of months ago - havent ran commands on it since.

Using 2.0.8 installed it today and checked with --version.

vitobotta commented 2 days ago

Can you SSH into the master and check the status of the k3s service? Also check the logs with journalctl.

boris-savic commented 2 days ago

Inspecting the logs it seems that there is something wrong with embedded-registry

Oct 09 21:03:24 test-cluster-cpx21-master1 k3s[515672]: time="2024-10-09T21:03:24Z" level=fatal msg="flag provided but not defined: -embedded-registry"

boris-savic commented 2 days ago

Changing the value for embedded registry mirror to false helped.

embedded_registry_mirror:
  enabled: false

Had some more errors, regarding System Upgrade Controller but that was resolved after rerunning the cluster create command.

Another issue I seem to have, but is minor, is that I have also some labels / taints set on the worker node (didn't include that in the config file as it was not really relevant), but something seems to be broken with that on migration. Full worker spec below:

worker_node_pools:
  - name: database-small-static
    instance_type: cpx21
    instance_count: 1
    location: fsn1
  # image: debian-11
    labels:
      - key: server/database
        value: "true"
    taints:
      - key: server/database
        value: true:NoSchedule

Error now seems to be

[Instance test-cluster-cpx21-pool-database-small-static-worker1] [INFO]  No change detected so skipping service start
[Instance test-cluster-cpx21-pool-database-small-static-worker1] ...k3s has been deployed to worker test-cluster-cpx21-pool-database-small-static-worker1.
[Node labels]
Adding labels to database-small-static pool workers...
error: resource(s) were provided, but no name was specified
[Node labels] : error: resource(s) were provided, but no name was specified
vitobotta commented 2 days ago

Ah yes! I hadn't thought about the embedded registry. Your current k3s version is not supported https://docs.k3s.io/installation/registry-mirror

So you'd have to upgrade k3s (with the hetzner-k3s upgrade command, see docs) to be able to enable the embedded registry.

As for the labels and taints, it was reported already and I have fixed it. I will try to make a new release this weekend or the next depending on the time available. In the meantime since it's just a few nodes, you can label them manually as a workaround.

boris-savic commented 2 days ago

Thank you for all the help.

Perhaps you can add to the 2.0.0 release notes in the upgrade section this new information :)

Closing the issue now