vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.76k stars 129 forks source link

First feedback: Looking for help testing v2.0.0 #387

Closed axgkl closed 3 weeks ago

axgkl commented 1 month ago

First of all: Thanks again.

The binary works on my fedora ootb, did not need to compile.

Nice to have:

Current minor issues:

Docs

Blocker I'm installing from my bastion host again, at 10.4.0.2, into private network 10.4.0.0/16.

output ``` Last login: Wed Jul 24 15:19:17 2024 from 84.150.82.88 root@bastion:~# ./hetzner-k3s create --config config.yaml [Configuration] Validating configuration... [Configuration] ...configuration seems valid. [SSH key] Creating SSH key... [SSH key] ...SSH key created [Placement groups] Deleting unused placement group ax-masters... [Placement groups] ...placement group ax-masters deleted [Placement groups] Creating placement group ax-masters... [Placement groups] ...placement group ax-masters created [Instance ax-master2] Creating instance ax-master2 (attempt 1)... [Instance ax-master3] Creating instance ax-master3 (attempt 1)... [Instance ax-master1] Creating instance ax-master1 (attempt 1)... [Instance ax-master2] Instance status: off [Instance ax-master2] Powering on instance (attempt 1) [Instance ax-master3] Instance status: off [Instance ax-master3] Powering on instance (attempt 1) [Instance ax-master2] Waiting for instance to be powered on... [Instance ax-master3] Waiting for instance to be powered on... [Instance ax-master1] Instance status: off [Instance ax-master1] Powering on instance (attempt 1) [Instance ax-master1] Waiting for instance to be powered on... [Instance ax-master2] Instance status: running [Instance ax-master2] Attaching instance to network (attempt 1) [Instance ax-master3] Instance status: running [Instance ax-master2] Waiting for instance to be attached to the network... [Instance ax-master3] Attaching instance to network (attempt 1) [Instance ax-master1] Instance status: running [Instance ax-master3] Waiting for instance to be attached to the network... [Instance ax-master1] Attaching instance to network (attempt 1) [Instance ax-master1] Waiting for instance to be attached to the network... [Instance ax-master3] Instance ax-master3 already exists, skipping create [Instance ax-master2] Instance ax-master2 already exists, skipping create [Instance ax-master1] Instance ax-master1 already exists, skipping create [Instance ax-master3] Instance status: running [Instance ax-master3] Waiting for successful ssh connectivity with instance ax-master3... [Instance ax-master2] Instance status: running [Instance ax-master2] Waiting for successful ssh connectivity with instance ax-master2... [Instance ax-master1] Instance status: running [Instance ax-master1] Waiting for successful ssh connectivity with instance ax-master1... [Instance ax-master3] ...instance ax-master3 is now up. [Instance ax-master2] Instance status: running [Instance ax-master2] Waiting for successful ssh connectivity with instance ax-master2... [Instance ax-master2] ...instance ax-master2 is now up. [Instance ax-master1] ...instance ax-master1 is now up. [Firewall] Updating firewall... [Instance ax-master3] Instance status: running [Instance ax-master3] Waiting for successful ssh connectivity with instance ax-master3... [Firewall] ...firewall updated [Instance ax-master2] ...instance ax-master2 is now up. [Instance ax-master2] ...instance ax-master2 created [Instance ax-master1] Instance status: running [Instance ax-master1] Waiting for successful ssh connectivity with instance ax-master1... [API Load balancer] Load balancer for API server already exists, skipping create [Instance ax-master3] ...instance ax-master3 is now up. [Instance ax-master3] ...instance ax-master3 created [Instance ax-master1] ...instance ax-master1 is now up. [Instance ax-master1] ...instance ax-master1 created [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] Awaiting cloud/instance/boot-finished... [Instance ax-master3] [INFO] Using v1.26.4+k3s1 as release [Instance ax-master3] [INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.4+k3s1/sha256sum-amd64.txt [Instance ax-master3] [INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.26.4+k3s1/k3s [Instance ax-master3] [INFO] Verifying binary download [Instance ax-master3] [INFO] Installing k3s to /usr/local/bin/k3s [Instance ax-master3] [INFO] Skipping installation of SELinux RPM [Instance ax-master3] [INFO] Creating /usr/local/bin/kubectl symlink to k3s [Instance ax-master3] [INFO] Creating /usr/local/bin/crictl symlink to k3s [Instance ax-master3] [INFO] Creating /usr/local/bin/ctr symlink to k3s [Instance ax-master3] [INFO] Creating killall script /usr/local/bin/k3s-killall.sh [Instance ax-master3] [INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh [Instance ax-master3] [INFO] env: Creating environment file /etc/systemd/system/k3s.service.env [Instance ax-master3] [INFO] systemd: Creating service file /etc/systemd/system/k3s.service [Instance ax-master3] [INFO] systemd: Enabling k3s unit [Instance ax-master3] [INFO] systemd: Starting k3s [Instance ax-master3] Waiting for the control plane to be ready... [Control plane] Saving the kubeconfig file to /root/kubeconfig... ^C ``` would time out here. On master3 the service file looks good: ``` ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null' ExecStartPre=-/sbin/modprobe br_netfilter ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/local/bin/k3s \ server \ '--disable-cloud-controller' \ '--disable' \ 'servicelb' \ '--disable' \ 'traefik' \ '--disable' \ 'metrics-server' \ '--write-kubeconfig-mode=644' \ '--node-name=ax-master3' \ '--cluster-cidr=10.50.0.0/16' \ '--service-cidr=10.60.0.0/16' \ '--cluster-dns=10.60.0.10' \ '--kube-controller-manager-arg=bind-address=0.0.0.0' \ '--kube-proxy-arg=metrics-bind-address=0.0.0.0' \ '--kube-scheduler-arg=bind-address=0.0.0.0' \ '--kubelet-arg' \ 'cloud-provider=external' \ '--kubelet-arg' \ 'resolv-conf=/etc/k8s-resolv.conf' \ '--etcd-expose-metrics=true' \ '--flannel-backend=none' \ '--disable-network-policy' \ '--disable-kube-proxy' \ '--embedded-registry' \ '--advertise-address=10.4.0.4' \ '--node-ip=10.4.0.4' \ '--node-external-ip=100.66.0.155' \ '--cluster-init' \ '--tls-san=' \ '--tls-san=10.4.0.3' \ '--tls-san=10.4.0.4' \ '--tls-san=10.4.0.5' \ '--tls-san=127.0.0.1' \ '--tls-san=65.109.223.252' \ ``` from config ```yaml --- cluster_name: "ax" kubeconfig_path: "./kubeconfig" k3s_version: "v1.26.4+k3s1" networking: ssh: port: 22 use_agent: false # set to true if your key has a passphrase public_key_path: "~/.ssh/id_ed25519.pub" private_key_path: "~/.ssh/id_ed25519" allowed_networks: ssh: - 0.0.0.0/0 api: - 0.0.0.0/0 public_network: ipv4: false ipv6: true private_network: enabled : true subnet: 10.4.0.0/16 existing_network_name: "ten_4" cni: enabled: true encryption: false mode: cilium cluster_cidr: 10.50.0.0/16 service_cidr: 10.60.0.0/16 cluster_dns: 10.60.0.10 datastore: mode: etcd # etcd (default) or external external_datastore_endpoint: postgres://.... schedule_workloads_on_masters: true masters_pool: instance_type: "cx11" instance_count: 3 location: "hel1" image: "ubuntu-22.04" worker_node_pools: - name: small-static instance_type: "cx11" instance_count: 0 location: "hel1" image: "ubuntu-22.04" # labels: # - key: purpose # value: blah # taints: # - key: something # value: value1:NoSchedule - name: medium-autoscaled instance_type: "cx11" instance_count: 3 location: "hel1" image: "ubuntu-22.04" autoscaling: enabled: true min_instances: 0 max_instances: 3 embedded_registry_mirror: enabled: true additional_packages: - ifupdown post_create_commands: - printf "started" > status - timedatectl set-timezone Europe/Berlin - echo 'ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBAgIhV5X2r7MthxFZBUZUZRo3zDbOS7k3GYXQVirne9W8fUE8QM1AUxVXOK/wL3zOSaP2CIqDT2OxfC+tWDnPu0= root@bastion' >> /root/.ssh/authorized_keys - echo 'ssh-rsa AAAqqdSp....' >> /root/.ssh/authorized_keys - echo "root:$(head -c 50 /dev/urandom | base64)" | chpasswd - ip route add default via 10.4.0.1 - ip route add 169.254.0.0/16 via 172.31.1.1 - mkdir -p /etc/network/interfaces.d - echo "auto ens10" > /etc/network/interfaces.d/ens10 - echo "iface ens10 inet dhcp" >> /etc/network/interfaces.d/ens10 - echo " post-up ip route add default via 10.4.0.1" >> /etc/network/interfaces.d/ens10 - echo " post-up ip route add 169.254.169.254 via 172.31.1.1" >> /etc/network/interfaces.d/ens10 - rm -f /etc/resolv.conf - echo "nameserver 185.12.64.1" > /etc/resolv.conf - echo "nameserver 185.12.64.2" >> /etc/resolv.conf - echo "edns edns0 trust-ad" >> /etc/resolv.conf - echo "search ." >> /etc/resolv.conf - printf "done" > status ``` That settings are still needed to get ssh working and internet access, when private ips.

On master1 and 2 there is nothing downloaded, as it looks not even attempted.

All masters can ping each other and all see the internet.

Let you know when I find out more why the process does not continue.

vitobotta commented 1 month ago

Can you share the exact steps you have taken or how you are using the bastion host etc? I haven’t tested without public ips yet. Tomorrow I will be traveling and will be back in a few days, so if you manage to figure out the issue when not using public ipa it would be great, otherwise I will look into it when I am back.

vitobotta commented 1 month ago

Added a note about helm and added a blank line as you suggested to avoid confusion about the cidr settings.

axgkl commented 1 month ago

Ok, I'm back. bastion is a cheap server with pub ip and 10.4.0.2 in a 10.4.0.0/16 network called 'ten-4', to which I set your config. The server NATs the outgoing requests from the cluster, and the ten-4 has a default route set to 10.0.0.2. I.e. all three masters can ping named address in the internet, with the given cloud init.

Is it already a problem that it starts with master3 or is this normal?

I try find the place in the code, can compile it here, for more debug.

Btw: I made myself a script which sets it all up but I guess to ineffective timewise, to look at. Btw: Could add your ssh key to bastion, from which calling your script fails, if that would help.

vitobotta commented 1 month ago

I am doing some quick testing with your config now.

vitobotta commented 1 month ago

It doesn't matter which master it uses as seed, since they are created in parallel so it just picks the first that was created.

axgkl commented 1 month ago

Ok first finding is this:

IF it fails then one often ends with an empty kubeconfig.

And THAT causes the new creates to crash early:

[Instance ax-master1] [INFO]  systemd: Starting k3s
[Instance ax-master1] Waiting for the control plane to be ready...
[Control plane] Saving the kubeconfig file to /root/kubeconfig...
^C
root@bastion:~# cat kubeconfig
root@bastion:~# ./hetzner-k3s create --config config.yaml
[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[SSH key] SSH key already exists, skipping create
Error creating instance: Expected Array or Hash, not Nil
Instance creation for ax-master1 failed. Try rerunning the create command.
Error creating instance: Expected Array or Hash, not Nil
Instance creation for ax-master2 failed. Try rerunning the create command.
Error creating instance: Expected Array or Hash, not Nil
Instance creation for ax-master3 failed. Try rerunning the create command.
^C
root@bastion:~# rm kubeconfig
root@bastion:~# ./hetzner-k3s create --config config.yaml
[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[SSH key] SSH key already exists, skipping create
[Instance ax-master2] Instance ax-master2 already exists, skipping create
[Instance ax-master1] Instance ax-master1 already exists, skipping create
[Instance ax-master3] Instance ax-master3 already exists, skipping create
[Instance ax-master1] Instance status: running
[Instance ax-master1] Waiting for successful ssh connectivity with instance ax-master1...
[Instance ax-master2] Instance status: running
[Instance ax-master2] Waiting for successful ssh connectivity with instance ax-master2...
[Instance ax-master3] Instance status: running
[Instance ax-master3] Waiting for successful ssh connectivity with instance ax-master3...
[Instance ax-master3] ...instance ax-master3 is now up.
[Instance ax-master1] ...instance ax-master1 is now up.
[Instance ax-master2] ...instance ax-master2 is now up.
[Firewall] Updating firewall...
[Firewall] ...firewall updated
[API Load balancer] Load balancer for API server already exists, skipping create
[Instance ax-master3] [INFO]  Using v1.26.4+k3s1 as release
[Instance ax-master3] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.4+k3s1/sha256sum-amd64.txt
[Instance ax-master3] [INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.26.4+k3s1/k3s
[Instance ax-master3] [INFO]  Verifying binary download
[Instance ax-master3] [INFO]  Installing k3s to /usr/local/bin/k3s
[Instance ax-master3] [INFO]  Skipping installation of SELinux RPM
[Instance ax-master3] [INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[Instance ax-master3] [INFO]  Creating /usr/local/bin/crictl symlink to k3s
[Instance ax-master3] [INFO]  Creating /usr/local/bin/ctr symlink to k3s
[Instance ax-master3] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance ax-master3] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance ax-master3] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance ax-master3] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance ax-master3] [INFO]  systemd: Enabling k3s unit
[Instance ax-master3] [INFO]  systemd: Starting k3s
[Instance ax-master3] Waiting for the control plane to be ready...
[Control plane] Saving the kubeconfig file to /root/kubeconfig...
vitobotta commented 1 month ago

I am still checking but I think the problem may be due to the embedded registry being enabled for a k3s version that doesn't support it.

axgkl commented 1 month ago

I.e. checking for kubeconfig present but rm-ing it if empty would make sense.

vitobotta commented 1 month ago

Yeah I can confirm the problem is the embedded registry, it is enabled in your config but it's not supported by the k3s version you specify. See https://docs.k3s.io/installation/registry-mirror

vitobotta commented 1 month ago

Disabling the registry mirror the cluster works just fine for me with your config. Can you double check?

axgkl commented 1 month ago

ok I set to false. Btw: The 2 other nodes are not even having /etc/initalized touched.

vitobotta commented 1 month ago

The cluster is working perfectly for me with that change. Are you still having problems? I am taking a note to add a config validation so that the create command aborts if you are trying to enable the registry mirror with an unsupported k3s version.

axgkl commented 1 month ago

YOU ROCK SO MUCH.

Ran through :sparkles:

Btw: I would have never ever used 1.26 but that's the version still in your config example, so I thought better use that one => That should be changed. (https://github.com/vitobotta/hetzner-k3s/blob/more-refactoring/docs/Creating_a_cluster.md)

vitobotta commented 1 month ago

Yep I will change it to the latest stable version. So all good for you now? I am also testing autoscaling with your config and it seems to work just fine for me.

vitobotta commented 1 month ago

There is a little issue that I have taken note now to fix, concerning autoscaling. When you create the cluster, the hetzner-k3s creates an SSH key with the fingerprint of the key you specify in the config file, but if a key with that fingerprint already exists it then skips that step. The problem is that when setting up the autoscaler, if a key with the same name as the cluster doesn't exist, no key will be set in the autoscaled nodes, so they will be set up with password auth only and you will receive email notifications from Hetzner with the passwords. No functionality is affected and the autoscaled nodes work just fine, but it's annoying. To fix I will ensure that when setting up the autoscaler I pick whichever key exists having the same fingerprint as the key specified in the config file. An alternative would be to abort the create command and ask the user to rename the existing key after the cluster or just delete it and let the tool recreate it.

axgkl commented 1 month ago

this was a massive milestone. I take it slowly, automating all, from my laptop incl. further steps.

Regarding the autoscaler created nodes: I know, got the ping from my company admin already about those mails. That chpasswd in the cloud init was the first attempt to work around it... Solution for me is to use the bastion key, which is not added permanently in hetzner, so you won't find the fingerprint and create it. And the actual important ones I get anyway via cloud init into authorized_keys...

vitobotta commented 1 month ago

BTW don't use Ubuntu 22 with Cilium as there is a known issue (connectivity to the server is lost when restarting k3s and you need to reboot the server). Use Ubuntu 24 or test with another OS.

axgkl commented 1 month ago

THAAAAANKS. Which OS you recommend?

Saw a vid recently about guys from reclaim your stack, he talked about you :-) Those went all the way, to Talos.

After crowdstrike and Jia Tan I really got a bit more sensitive about fat base systems.

vitobotta commented 1 month ago

To be honest I just use Ubuntu. It's simple and just works for the most part. It's one of the default images in Hetzner so servers get created very quickly compared to using non standard images. For me it works great. If you want a minimalist OS with a smaller attack surface you can use MicroOS with hetzner-k3s. Someone was using it and I also tested it a while ago and it worked well. But personally I like to keep things as simple as possible (that's why I went with k3s in first place).

axgkl commented 1 month ago

Btw: I think here it should also be, the waiting:

2024-07-24_809x217_scrot

vitobotta commented 1 month ago

What do you mean? It's the script already.

axgkl commented 1 month ago

worker. not master. My PR was only for master :blush:

Did not even see worker last week....

vitobotta commented 1 month ago

Ah, I see. I'll add it now. Thanks

vitobotta commented 1 month ago

Hey would you mind adding a docs page with instructions and your config example on how to set up a cluster with nodes without public IPs? :)

vitobotta commented 1 month ago

There is another thing I need to do: when you disable public IPs in the config it should also disable them for autoscaled nodes. This wasn't supported by the autoscaler before but I think it is now according to someone who told me.

vitobotta commented 1 month ago

The embedded registry mirror is awesome! I tested scaling a deployment to 50 replicas spanning as many nodes and the new replicas were created super fast thanks to the peer to peer distribution of the image. Pretty cool IMO.

axgkl commented 1 month ago

also disable them for autoscaled nodes

Oh yeah, this is so cool!

adding a docs page

I will. If you don't mind I add my setup script from scratch, which sets up bastion and the base network and all that, then installs hetzner-k3s on the bastion and renders the config.

Got a bit big though, nearly 500 lines... but I would label it as opinionated suggestion/example. :blush:

Or you prefer to have such non crystal code out of this repo and rather add a link to another repo for it?

vitobotta commented 1 month ago

Eventually I will implement native support for this kind of setup directly in the tool so we won't need any scripts, so these instructions etc are more of a temporary thing since it's gonna take a while and several people every now and then have asked about this kind of setup. So your instructions would be very useful to others :)

axgkl commented 1 month ago

thanks to the peer to peer distribution of the image.

TBH, as a k8s noob, I assumed, with all these crazy machinery that they 100% securely pull an image only once into the system over the internet, when x pods require it to start. To learn they did not is a bit ...crazy.

vitobotta commented 1 month ago

Yeah thinking of it it's definitely something I'd like to see built in Kubernetes itself. But it's nice that k3s installs Spegel for us and we only need to enable it.

axgkl commented 1 month ago

Ol rite, was also thinking about adding it all to the binary and send you a PR. But I'm a python guy and also I wanted ppl in my company to have an easy time adding their own functions for their higher level needs, e.g. OTEL and DBs. They are still doing all in ansible since we need a lot of on prem setups as well outside k8s...

Give me a few days to polish all up while you are biz tripping. Cheers, Gunther

vitobotta commented 1 month ago

Sounds good 👍 Thanks

vitobotta commented 1 month ago

and thanks a lot for helping with testing this scenario!

vitobotta commented 1 month ago

@axgkl Can we close this now? I am trying to clean up the issues list :)

vitobotta commented 3 weeks ago

I think we can close this now.