Closed axgkl closed 4 months ago
@vitobotta
I think I found the reason for the k3s install to fail with private IPs:
[1]
, with this the first lines of master_install_script.sh
:while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
echo "Waiting for cloud init..."
sleep 5
done
(...)
Waiting for successful ssh connectivity with server axc3-cx22-master3...
...server axc3-cx22-master1 is now up.
...server axc3-cx22-master3 is now up.
...server axc3-cx22-master2 is now up.
Creating load balancer for API server...done.
=== Setting up Kubernetes ===
Deploying k3s to first master axc3-cx22-master1...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] Waiting for cloud init...
[axc3-cx22-master1] [INFO] Using v1.29.6+k3s2 as release
[axc3-cx22-master1] [INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/sha256sum-amd64.txt
[axc3-cx22-master1] [INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/k3s
[axc3-cx22-master1] [INFO] Verifying binary download
[axc3-cx22-master1] [INFO] Installing k3s to /usr/local/bin/k3s
(...) all working, running through at first attempt :sparkle:
I.e. it takes nearly a minute until cloud init is done - and before that, all repeated attempts of running your app will fail in such a setup.
Sent you a PR. I know, you work on a new version but maybe it's a nobrainer.
[1]
: Had to comment out the stern go lib in the Dockerfile. It depends an a newer go version (1.22) than the one you install into the image (1.21).
Hi @axgkl , thanks a lot for your investigation and for the PR! As for the part where you detail the steps to set up a cluster with private IPs only, I was going to convert it to a discussion but I will leave it as an issue for now because I might try to implement all of that directly in the tool in a future release.
As for the PR you already made, can you verify if it works with different distros before I merge it? Thanks!
can you verify if it works with different distros
sure thing, it would hang forever without that file.
Checked all instances I have currently and all had the file, intel and arm.
Created instances for all images under `api.hetzner.cloud/v1/images' and checked for presence of the file:
❯ ./check.sh
Name: ubuntu-20-04
IP: 65.21.187.83
Name: lamp
IP: 37.27.183.116
Name: wordpress
IP: 95.217.217.42
Name: jitsi
IP: 37.27.42.244
Name: nextcloud
IP: 95.216.147.135
Name: docker-ce
IP: 37.27.95.225
Name: gitlab
IP: 65.109.166.111
Name: debian-11
IP: 65.109.169.151
Name: rocky-8
IP: 65.21.3.35
Name: centos-stream-9
IP: 65.109.130.114
Name: ubuntu-22-04
IP: 65.21.62.187
Name: prometheus-grafana
IP: 135.181.254.190
Name: rocky-9
IP: 95.217.131.165
Name: wireguard
IP: 65.21.57.20
Name: owncast
IP: 65.21.111.73
Name: photoprism
IP: 135.181.25.8
Name: rustdesk
IP: 37.27.176.115
Name: alma-8
IP: 65.109.173.121
Name: alma-9
IP: 37.27.185.18
Presence check follows:
65.21.187.83
-rw-r--r-- 1 root root 69 Jul 16 11:15 /var/lib/cloud/instance/boot-finished
37.27.183.116
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
95.217.217.42
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
37.27.42.244
-rw-r--r-- 1 root root 69 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
95.216.147.135
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
37.27.95.225
-rw-r--r-- 1 root root 61 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.109.166.111
-rw-r--r-- 1 root root 69 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.109.169.151
-rw-r--r-- 1 root root 52 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.3.35
-rw-r--r--. 1 root root 67 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.109.130.114
-rw-r--r--. 1 root root 57 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.62.187
-rw-r--r-- 1 root root 69 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
135.181.254.190
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
95.217.131.165
-rw-r--r--. 1 root root 62 Jul 16 11:16 /var/lib/cloud/instance/boot-finished
65.21.57.20
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.21.111.73
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
135.181.25.8
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
37.27.176.115
-rw-r--r-- 1 root root 61 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
65.109.173.121
-rw-r--r--. 1 root root 68 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
37.27.185.18
-rw-r--r--. 1 root root 65 Jul 16 11:17 /var/lib/cloud/instance/boot-finished
rm -f /var/lib/cloud/instance/boot-finished
into the cloud init box and manually created the instance ;-)
Result: File is present, nevertheless.So yeah, I think, we should be safe - guess even with self made images one cannot bypass the cloud init system, did not find any hints on that possibility.
I might try to implement all of that directly in the tool in a future release.
I really like the idea of having just one plain linux host with a pub ip, with a few forwarding rules or a caddy server, into the cluster, plus the NAT. Because any junior can recreate it for me, when it fails - in stark contrast to a kubernetes master, especially after 5 new major k8s releases. I.e. one bastion plus 3 internal masters handling also load is for me the best and also cheapest setup. I don't even need the API loadbalancer, keep deleting it manually. Using a plain ssh portforward via that bastion node, into the cluster, on my laptop for kubectl.
=> Would be really cool if you allow in your next release to not have that created.
PS: The more I look into the tool, the more I like it. Awesome language also, that crystal. Getting static binaries plus that performance in a language which feels like scripting is insane :heart_eyes:
Cool, thanks a lot for checking with other images too! I'll merge the PR in a moment. Pretty handy and glad you found the problem that several people were having. I have been too busy with the day job and bug bounties so I haven't had much time lately and I still need to finalize v2 of the tool.
As for the API load balancer I was already thinking to allow choosing between the load balancer or a composite kubeconfig with the contexts for all the masters merged into one kubeconfig, so the user can just switch from a master to another if needed without requiring a load balancer.
Crystal is pretty cool, I love it. I have been working with Ruby for eons so when I needed to choose a language to be able to offer standalone binaries, Crystal was the natural choice since it's pretty similar in syntax. And yes, it's fast! I also love the way you can use channels for concurrency.
I still need to finalize v2 so I haven't had much time lately
https://www.softwaremaxims.com/blog/not-a-supplier (last paragraph pretty good sums it up: You don't need to do anything in this one, the tool is already a massive help. Just my 2 cents.)
@axgkl very nice write-up... But I guess I am missing something.
I have setup a NAT gateway and configured the network. If I create the master, and SSH to it through the bastion host, I can reach the internet (most importantly github.com). But hetzner-k3s
is stuck with
Waiting for successful ssh connectivity with server test-cax11-master1...
This will never work because there isn't any public IP address and hetzner-k3s
won't be able to use the jump host out of the box, is it?
(I used the latest dev version, btw)
Edit: OK, for some reason I didn't understand that I should run hetzner-k3s
from within the custom network so it can reach the servers on the internal IP addresses... Ignore this question...
@axgkl rc1 of v2 is now available. Would you be able to help with testing? If yes, please see https://github.com/vitobotta/hetzner-k3s/discussions/385 for details on rc1. Thanks
@Funzinator if you can help too that would be great :)
Edit: OK, for some reason I didn't understand that I should run
hetzner-k3s
from within the custom network so it can reach the servers on the internal IP addresses... Ignore this question...
@Funzinator Exactly. Has also the advantage of somewhat better chances that it works on the bastion image - had to compile it on my fedora here due to a problem with a library, discussed elsewhere. Currently I'm making a script, which sets it all up from scratch, from your laptop, i.e. bastion itself, the private network then downloads hetzner-k3s on bastion and kicks off the process. If you are interested I make it available.
@vitobotta Happy to test it, will include that version in my from scratch script (and thanks for investing yet more of your time making the world a better place ;-) )
@axgkl I am using an ansible-playbook for my bastion host (that also servers as a NAT64 gateway), so I guess I am good for now - but I always seek inspiration from other's work, so feel free to share.
In order to build the project, I had to fix a dependency in Dockerfile.dev
:
-RUN go install github.com/stern/stern@latest
+RUN go install github.com/stern/stern@9763d95
because alpine doesn't have go1.22 in this particular version and the pinned commit is the last release that works with go1.21 Is that the library you meant?
@vitobotta I will also give it a test run, hopefully in the next days.
@Funzinator the config discussed in https://github.com/vitobotta/hetzner-k3s/issues/387 works very well.
Is that the library you meant?
@Funzinator Rite, I commented that out when building the filesystem for my compile runs=> Many thanks for this actual fix ;-)
PS: I'll put a docs page together the next days, since vito asked me to.
Hi,
since I fiddled for quite some time, maybe I can save some time for others with this.
The
enable_public_net_ipv4: false
switch did involve a few problems here, so I try list my workarounds. Others report that cluster creation works (e.g. #372) - but not for me and seemingly others - I keep getting the same errors, even with the exact same config than #372 :-/ (except the name of the priv network).Goals
Problems
a) k3s did not install, empty kubeconfig on bastion node, failing with STDIN error at first kubeconfig apply time - even over repeated retries b) missing DNS on the cluster nodes c) autoscaler did create the nodes but could not include them into the cluster d) ssh rejects (max retries exceeded) when trying to ssh into the autoscaled nodes e) upgrade controller was broken
Solutions
e) was caused by this and was solved by
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
after the cluster was up. I expect rancher to fix this soon, i.e. might work when you read this out of the box (again).All other points addressed with the following config:
The bastion host has 10.0.0.2 (first host created on the private network) and has a public ipv4. On it,
echo 1 > /proc/sys/net/ipv4/ip_forward
andiptables -t nat -A POSTROUTING -s '10.0.0.0/16' -o eth0 -j MASQUERADE
are run and persisted.Discussion
The ifupdown based network setup configures the nodes to go to the internet via the bastion host, which NATs their traffic.
10.0.0.1
- while the bastion host is on10.0.0.2
- Has to be like that!tracepath -n 169.254.169.254
on the bastion host, first hop.The DNS is brutally set to point to the Hetzner's DNS servers, away from systemd resolvd, which did not resolve for queries from within containers.
I don't reboot for install speed - but the routes and dns mod are persistent over reboots
There is a specific route for the hetzner API endpoint at
169.254.169.254
, which the autoscaler uses to find the instance name. Without that route, the autoscaler reports something like "is this a hetzner node?" (from the hcloud go lib), in the logs and does not include new nodes into the cluster.The k3s is manually downloaded via wget, and put into place (in the matching version). Then in bashrc, the script gets a
INSTALL_K3S_SKIP_DOWNLOAD
set to true. Reason: I keep getting failing downloads when triggered from the script - I'm sure it's not the fault of the script, works perfect with pub IPs but due to this network config. With this it works reliably, the second time you run it, after the first STDIN failure.Edit: See my following comment for the reason of this.
Thanks in any case for the nice installer, way easier than terraform!