samcday / home-cluster

3 stars 0 forks source link

Cloud cluster #549

Closed samcday closed 2 weeks ago

samcday commented 3 weeks ago

I can dare to dream that someday I'll have a residential internet connection able to squirt more than 5KB/s upstream at any given time. For now, I'm stuck with the shrimpy cable uplink that I have.

Whilst I'm able to easily expose stuff publicly with both Cloudflare Tunnels and Tailscale Funnel, workloads that are essentially entirely public-only like Mastodon would be better served running in a cloud environment of some kind with a proper 1GB uplink and some scale headroom for slashdot effects.

Previously I dabbled with spanning my cluster across multiple environments. That is, I spun up workers in hcloud+OCI and connected them directly to my home cluster control plane. With Cilium on top of the tailnet L2 mesh stuff, this worked somewhat. It's way too easy to inadvertently slam my residential uplink though, if a workload in the cloud decides it wants to pull a bunch of data from a service running here at home.

This time around, I'll instead run a separate cloud cluster. The key trick will be to host the control plane on my home cluster, though.

Back at Hetzner I'd started crafting some Helm charts to build and manage k8s control planes. Sadly that stuff is probably rotting away in their internal Gitlab somewhere, so I'll need to rebuild those (in the open, this time!).

To keep the maintenance low(er) I'll re-use the root etcd. I already switched it over to RBAC auth last night and set up some infra for this.

DoD:

samcday commented 2 weeks ago

I've got a Helm chart that brings up a basic control plane now.

The metrics are being scraped and labelled in such a way that they already pop up in the cluster dropdown in the official Grafana mixin dashboards, as well as prom rules firing alerts on those metrics.

I am sure the alerts work because I already got a few alerts about apiserver client certificates expiring soon - I'd been experimenting with super short expiration (24h) certs, but I've pushed it out to 2 weeks now to hopefully stay clear of the warnings.

samcday commented 2 weeks ago

https://redlib.samcday.com is running from the cloud cluster now.

The cluster autoscales compute and storage in hcloud. Since the controllers run in home-cluster, in theory I could scale to zero. Since my current scale unit is cax11 and they are very cheap, I won't bother with this.

Ingress is through usual nginx + Cloudflare Tunnels setup. The tunnel is provisioned via tofu+tf-controller. I opted for this setup instead of hcloud resources so that I can keep the nodes 100% firewalled and avoid extra load balancer costs. Also anything that runs in the cloud environment might as well be fronted by Cloudflare too.

Yeah, firewalling. 100%, even SSH. Only the Wireguard UDP port is permitted. The first thing these nodes do in cloud-config is installed tailscale and login. The control plane is only accessible via tailnet.

I'll steadily move over a couple of other workloads when I can be bothered or the need arises. I'll close this out as done.