Resiliency to upstream WAN failure

My trash residential cable provider is having issues yet again today - Internet has been down for nearly an hour so far. Their status page acknowledges the issue (yay) and indicates that a status update will be provided at 3pm, which is 4 hours from now (not yay!).

The cluster was unfortunately caught in the middle of rolling out an image update. Since I don't have Harbor (#140) yet to keep a hot cache of in-use images, I don't really expect such cluster operations to work without internet access.

One obvious issue is DNS. Currently, I access a mixture of tailnet-private/public-but-personal services, such as Miniflux and Navidrome, from .samcday.com addresses. These are public Cloudflare records managed by external-dns.

A possible solution here is to make the home-cluster router the upstream for most of my at-home devices, and then configure its dnsmasq to split-DNS and be authoritative for the appropriate .samcday.com hostnames. In that case, navidrome.samcday.com, whilst publicly available via Cloudflare tunnel in the default public DNS (so I can share album links), could/should/would resolve to the RFC1918 IPs of the cluster nodes.

DoD:

[x] Cluster functions correctly without internet access.
[x] Workloads can be rescheduled without internet access (implies #140 is done)
[x] Cluster services are accessible from the same network without Internet access.
[x] Cluster router uses mwan3 to automatically failover from primary upstream to secondary (LTE stick / tethered phone)

With all the work I've done on Harbor + Squid, a lot more bytes that are originally acquired from the internet are instead served from disk caches that reside in the cluster.

It's nowhere near a point where it satisfyingly operates in a no-internet/airgapped mode. That's not really my intent/focus (for now) with this project.

Vodafone Kabel Deutschland, with whom my contract will be imminently ending, fumbled my internet access for a solid 2.5 days. It went down Saturday 9am, and did not come back up until 2pm Monday.

Despite that, I ended up keeping the cluster running okay enough via my phone tethered internet. Turns out Telekom was running some kind of insane deal that gives everyone unlimited LTE data until mid-July because sportsball or something. Anyway. I availed myself of a couple of dozen GBs via LTE during that time.

If that happened again, I think there's be a lot less data usage. Now all the image rebasing is pulled through Harbor/Squid caches.

I also rolled out Cilium L2 LB and that's how just about all traffic in the cluster (web, k8s api at the last) has been flowing for a few days now.

samcday / home-cluster

Resiliency to upstream WAN failure #489