vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.91k stars 143 forks source link

Some workarounds for priv IP clusters #379

Closed axgkl closed 4 months ago

axgkl commented 4 months ago

Hi,

since I fiddled for quite some time, maybe I can save some time for others with this.

The enable_public_net_ipv4: false switch did involve a few problems here, so I try list my workarounds. Others report that cluster creation works (e.g. #372) - but not for me and seemingly others - I keep getting the same errors, even with the exact same config than #372 :-/ (except the name of the priv network).

Goals

Problems

a) k3s did not install, empty kubeconfig on bastion node, failing with STDIN error at first kubeconfig apply time - even over repeated retries b) missing DNS on the cluster nodes c) autoscaler did create the nodes but could not include them into the cluster d) ssh rejects (max retries exceeded) when trying to ssh into the autoscaled nodes e) upgrade controller was broken

Solutions

e) was caused by this and was solved by kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml after the cluster was up. I expect rancher to fix this soon, i.e. might work when you read this out of the box (again).

All other points addressed with the following config:

additional_packages:
 - ifupdown

post_create_commands:
  - printf "started" > status
  - timedatectl set-timezone Europe/Berlin
  - ip route add default via 10.0.0.1
  - ip route add 169.254.0.0/16 via 172.31.1.1
  - mkdir -p /etc/network/interfaces.d
  - echo "auto enp7s0"                                              > /etc/network/interfaces.d/enp7s0
  - echo "iface enp7s0 inet dhcp"                                  >> /etc/network/interfaces.d/enp7s0
  - echo "    post-up ip route add default via 10.0.0.1"           >> /etc/network/interfaces.d/enp7s0
  - echo "    post-up ip route add 169.254.169.254 via 172.31.1.1" >> /etc/network/interfaces.d/enp7s0
  - rm -f                            /etc/resolv.conf
  - echo 'nameserver 185.12.64.1'  > /etc/resolv.conf
  - echo 'nameserver 185.12.64.2' >> /etc/resolv.conf
  - echo 'edns edns0 trust-ad'    >> /etc/resolv.conf
  - echo 'search .'               >> /etc/resolv.conf
  - sed -i '1i export INSTALL_K3S_SKIP_DOWNLOAD=true' /root/.bashrc
  - wget 'https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/k3s'
  - chmod +x k3s
  - mv k3s /usr/local/bin/k3s
  - echo 'ssh-ed25519 AAAAC3Nz....  admin@bast' >> /root/.ssh/authorized_keys
  - echo 'ssh-rsa AAAAB3Nzx1yc...  me@mylaptop' >> /root/.ssh/authorized_keys
  - echo "root:somesupersecret" | chpasswd
  - printf "done" > status

The bastion host has 10.0.0.2 (first host created on the private network) and has a public ipv4. On it, echo 1 > /proc/sys/net/ipv4/ip_forward and iptables -t nat -A POSTROUTING -s '10.0.0.0/16' -o eth0 -j MASQUERADE are run and persisted.

Discussion

Edit: See my following comment for the reason of this.


Thanks in any case for the nice installer, way easier than terraform!

axgkl commented 4 months ago

@vitobotta

I think I found the reason for the k3s install to fail with private IPs:

  1. I compiled a new version of hetzner-k3s, using your Dockerfile.dev[1], with this the first lines of master_install_script.sh:
while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
    echo "Waiting for cloud init..."
    sleep 5
done
  1. Output, with k3s download removed from my cloud init above:
(...)
Waiting for successful ssh connectivity with server axc3-cx22-master3...
...server axc3-cx22-master1 is now up.
...server axc3-cx22-master3 is now up.
...server axc3-cx22-master2 is now up.
Creating load balancer for API server...done.

=== Setting up Kubernetes ===
Deploying k3s to first master axc3-cx22-master1...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] [INFO]  Using v1.29.6+k3s2 as release
[axc3-cx22-master1] [INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/sha256sum-amd64.txt
[axc3-cx22-master1] [INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/k3s
[axc3-cx22-master1] [INFO]  Verifying binary download
[axc3-cx22-master1] [INFO]  Installing k3s to /usr/local/bin/k3s
(...) all working, running through at first attempt :sparkle: 

I.e. it takes nearly a minute until cloud init is done - and before that, all repeated attempts of running your app will fail in such a setup.

Sent you a PR. I know, you work on a new version but maybe it's a nobrainer.

[1]: Had to comment out the stern go lib in the Dockerfile. It depends an a newer go version (1.22) than the one you install into the image (1.21).

vitobotta commented 4 months ago

Hi @axgkl , thanks a lot for your investigation and for the PR! As for the part where you detail the steps to set up a cluster with private IPs only, I was going to convert it to a discussion but I will leave it as an issue for now because I might try to implement all of that directly in the tool in a future release.

As for the PR you already made, can you verify if it works with different distros before I merge it? Thanks!

axgkl commented 4 months ago

can you verify if it works with different distros

sure thing, it would hang forever without that file.

  1. Checked all instances I have currently and all had the file, intel and arm.

  2. Created instances for all images under `api.hetzner.cloud/v1/images' and checked for presence of the file:

 ❯ ./check.sh
Name: ubuntu-20-04
IP: 65.21.187.83
Name: lamp
IP: 37.27.183.116
Name: wordpress
IP: 95.217.217.42
Name: jitsi
IP: 37.27.42.244
Name: nextcloud
IP: 95.216.147.135
Name: docker-ce
IP: 37.27.95.225
Name: gitlab
IP: 65.109.166.111
Name: debian-11
IP: 65.109.169.151
Name: rocky-8
IP: 65.21.3.35
Name: centos-stream-9
IP: 65.109.130.114
Name: ubuntu-22-04
IP: 65.21.62.187
Name: prometheus-grafana
IP: 135.181.254.190
Name: rocky-9
IP: 95.217.131.165
Name: wireguard
IP: 65.21.57.20
Name: owncast
IP: 65.21.111.73
Name: photoprism
IP: 135.181.25.8
Name: rustdesk
IP: 37.27.176.115
Name: alma-8
IP: 65.109.173.121
Name: alma-9
IP: 37.27.185.18

Presence check follows:
65.21.187.83
-rw-r--r-- 1 root root 69 Jul 16 11:15 /var/lib/cloud/instance/boot-finished
37.27.183.116
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
95.217.217.42
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
37.27.42.244
-rw-r--r-- 1 root root 69 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
95.216.147.135
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
37.27.95.225
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.109.166.111
-rw-r--r-- 1 root root 69 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.109.169.151
-rw-r--r-- 1 root root 52 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.3.35
-rw-r--r--. 1 root root 67 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.109.130.114
-rw-r--r--. 1 root root 57 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.62.187
-rw-r--r-- 1 root root 69 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
135.181.254.190
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
95.217.131.165
-rw-r--r--. 1 root root 62 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.57.20
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.21.111.73
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
135.181.25.8
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
37.27.176.115
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.109.173.121
-rw-r--r--. 1 root root 68 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
37.27.185.18
-rw-r--r--. 1 root root 65 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
Created via.. ```bash token=$(pass show HCloud/token) images=$(curl -H "Authorization: Bearer $token" "https://api.hetzner.cloud/v1/images" | jq -r '.images[].name') for image in $images; do name="$(echo $image | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9-]/-/g' | cut -c 1-63)" echo "$name" curl -X POST \ -H "Authorization: Bearer $token" \ -H "Content-Type: application/json" \ -d '{ "name": "'$name'", "server_type": "cx11", "image": "'$image'", "location": "hel1", "ssh_keys": ["hcloud_key_root"] }' \ "https://api.hetzner.cloud/v1/servers" | tee "$name" sleep 4 done ```
  1. Lastly, on an arm instance I've put rm -f /var/lib/cloud/instance/boot-finished into the cloud init box and manually created the instance ;-) Result: File is present, nevertheless.

So yeah, I think, we should be safe - guess even with self made images one cannot bypass the cloud init system, did not find any hints on that possibility.

axgkl commented 4 months ago

I might try to implement all of that directly in the tool in a future release.

I really like the idea of having just one plain linux host with a pub ip, with a few forwarding rules or a caddy server, into the cluster, plus the NAT. Because any junior can recreate it for me, when it fails - in stark contrast to a kubernetes master, especially after 5 new major k8s releases. I.e. one bastion plus 3 internal masters handling also load is for me the best and also cheapest setup. I don't even need the API loadbalancer, keep deleting it manually. Using a plain ssh portforward via that bastion node, into the cluster, on my laptop for kubectl.

=> Would be really cool if you allow in your next release to not have that created.

PS: The more I look into the tool, the more I like it. Awesome language also, that crystal. Getting static binaries plus that performance in a language which feels like scripting is insane :heart_eyes:

vitobotta commented 4 months ago

Cool, thanks a lot for checking with other images too! I'll merge the PR in a moment. Pretty handy and glad you found the problem that several people were having. I have been too busy with the day job and bug bounties so I haven't had much time lately and I still need to finalize v2 of the tool.

As for the API load balancer I was already thinking to allow choosing between the load balancer or a composite kubeconfig with the contexts for all the masters merged into one kubeconfig, so the user can just switch from a master to another if needed without requiring a load balancer.

Crystal is pretty cool, I love it. I have been working with Ruby for eons so when I needed to choose a language to be able to offer standalone binaries, Crystal was the natural choice since it's pretty similar in syntax. And yes, it's fast! I also love the way you can use channels for concurrency.

axgkl commented 4 months ago

I still need to finalize v2 so I haven't had much time lately

https://www.softwaremaxims.com/blog/not-a-supplier (last paragraph pretty good sums it up: You don't need to do anything in this one, the tool is already a massive help. Just my 2 cents.)

Funzinator commented 4 months ago

@axgkl very nice write-up... But I guess I am missing something.

I have setup a NAT gateway and configured the network. If I create the master, and SSH to it through the bastion host, I can reach the internet (most importantly github.com). But hetzner-k3s is stuck with

Waiting for successful ssh connectivity with server test-cax11-master1...

This will never work because there isn't any public IP address and hetzner-k3s won't be able to use the jump host out of the box, is it?

(I used the latest dev version, btw)

Edit: OK, for some reason I didn't understand that I should run hetzner-k3s from within the custom network so it can reach the servers on the internal IP addresses... Ignore this question...

vitobotta commented 4 months ago

@axgkl rc1 of v2 is now available. Would you be able to help with testing? If yes, please see https://github.com/vitobotta/hetzner-k3s/discussions/385 for details on rc1. Thanks

@Funzinator if you can help too that would be great :)

axgkl commented 4 months ago

Edit: OK, for some reason I didn't understand that I should run hetzner-k3s from within the custom network so it can reach the servers on the internal IP addresses... Ignore this question...

@Funzinator Exactly. Has also the advantage of somewhat better chances that it works on the bastion image - had to compile it on my fedora here due to a problem with a library, discussed elsewhere. Currently I'm making a script, which sets it all up from scratch, from your laptop, i.e. bastion itself, the private network then downloads hetzner-k3s on bastion and kicks off the process. If you are interested I make it available.

@vitobotta Happy to test it, will include that version in my from scratch script (and thanks for investing yet more of your time making the world a better place ;-) )

Funzinator commented 4 months ago

@axgkl I am using an ansible-playbook for my bastion host (that also servers as a NAT64 gateway), so I guess I am good for now - but I always seek inspiration from other's work, so feel free to share.

In order to build the project, I had to fix a dependency in Dockerfile.dev:

-RUN go install github.com/stern/stern@latest
+RUN go install github.com/stern/stern@9763d95

because alpine doesn't have go1.22 in this particular version and the pinned commit is the last release that works with go1.21 Is that the library you meant?

@vitobotta I will also give it a test run, hopefully in the next days.

vitobotta commented 4 months ago

@Funzinator the config discussed in https://github.com/vitobotta/hetzner-k3s/issues/387 works very well.

axgkl commented 4 months ago

Is that the library you meant?

@Funzinator Rite, I commented that out when building the filesystem for my compile runs=> Many thanks for this actual fix ;-)

PS: I'll put a docs page together the next days, since vito asked me to.