This Terraform module creates a Kubernetes Cluster on Hetzner Cloud infrastructure running Ubuntu 22.04. The module aims to be simple to use while providing an out-of-the-box secure and maintainable setup. Thanks to Ubuntu's LTS version we get up to 5 years of peace and quiet before having to upgrade the cluster's operating system!
Terraform module published at: https://registry.terraform.io/modules/identiops/k3s/hcloud
What changed in the latest version? See CHANGELOG.md.
terraform output
).bash
for executing the generated scripts.jq
for executing the generated scripts.kubectl
for interacting wthe the Kubernetes cluster.ssh
for connecting to cluster nodes.terraform.tfvars
:# Either, enter your Hetzner Cloud API Token (it will be hidden)
read -sp "Hetzner Cloud API Token: " TF_VAR_hcloud_token
export TF_VAR_hcloud_token
read -sp "Hetzner Cloud API read only Token: " TF_VAR_hcloud_token_read_only
export TF_VAR_hcloud_token_read_only
# Or store the token in terraform.tfvars
touch terraform.tfvars
chmod 600 terraform.tfvars
cat >terraform.tfvars <<EOF
hcloud_token = "XYZ"
hcloud_token_read_only = "ABC"
EOF
examples/1Region_3ControlPlane_3Worker_Nodes/main.tf
:
curl -LO https://github.com/identiops/terraform-hcloud-k3s/raw/main/examples/1Region_3ControlPlane_3Worker_Nodes/main.tf
main.tf
, e.g.
cluster_name
default_location
k3s_version
ssh_keys
(to create a new ssh key run: ssh-keygen -t ed25519
)node_pools
location
for each pool.control_plane_k3s_additional_options
, e.g. to
--etcd-arg=heartbeat-interval=120 --etcd-arg=election-timeout=1200
Measurements between Falkenstein, Nuremberg and Helsinki: I measured a
latency of 0.7ms (within Nuremberg region), 3ms (Nuremberg -> Falkenstein),
and 24ms (Nuremberg -> Helsinki).network_zone
setting should
be adjusted.terraform init
terraform apply
./ssh-node cluster
cloud-init status
kubectl cluster-info
./setkubeconfig
./ssh-node gateway
kubectl get nodes
Enjoy your new cluster! 🚀
Start using your favorite Kubernetes tools to interact with the cluster. One of the first steps usually involves deploying an ingress controller since this configuration doesn't ship one.
In addition, a few convenience scripts were created to help with maintenance:
setkubeconfig
: retrieves and stores the Kubernetes configuration locally.unsetkubeconfig
: removes the cluster from the local Kubernetes
configuration.ls-nodes
: lists the nodes that are part of the cluster for access via
ssh-node
and scp-node
.ssh-node
: SSH wrapper for connecting to cluster nodes.scp-node
: SCP wrapper for connecting to cluster nodes..ssh/config
: SSH configuration for connecting to cluster nodes..ansible/hosts
: Ansible hosts configuration for executing commands on
multiple nodes in parallel.This module automatically generates an
Ansible inventory
in file .ansible/hosts
. It can be leveraged to interact with the nodes and
node pools of the cluster.
Example: Execute a command on all control plane nodes
ANSIBLE_INVENTORY="$PWD/.ansible/hosts" ansible all_control_plane_nodes -a "kubectl cluster-info"
Since this module doesn't ship an ingress controller, one of the first configurations applied to the cluster is usually an ingress controller. Good starting points for an ingress controller are:
The ingress controller, like the rest of the cluster, is not directly exposed to the Internet. Therefore, it is necessary to add a load balancer that is directly exposed to the Internet and has access to the local network of the cluster. The load balancer is added to the cluster simply by adding annotations to the ingress controller's service. Hetzner's Cloud Controller Manager will use the annotations to deploy and configure the load balancer.
The following annotations should be used:
load-balancer.hetzner.cloud/name: "ingress-lb"
tcp
- the ingress controller will take care of the HTTP
connection: load-balancer.hetzner.cloud/protocol: "tcp"
load-balancer.hetzner.cloud/location: "nbg1"
true
:
load-balancer.hetzner.cloud/use-private-ip: "true"
load-balancer.hetzner.cloud/type: "lb11"
Furthermore, for domain names to work, it is required to point DNS records to the IP address of load balancer. external-dns is a helpful tool that can automate this task from within the cluster. For this to work well with Ingress resources, the ingress controller needs to expose the published service information on the Ingress resources.
The number of nodes in a node pool can be increased at any point. Just increase
the count and apply the new configuration via terraform apply
. After a few
minutes the additional nodes will appear in the cluster.
In the same way, node pools can be added to the configuration without any precaution.
Removing nodes requires the following steps:
system
is decreased from 3
to 2
, node cluster-system-02
will be
removed and nodes cluster-system-01
and cluster-system-00
will remain.kubectl drain cluster-system-02
kubectl uncordon cluster-system-02
terraform apply
kubectl delete node cluster-system-02
Nodes are rebooting automatically when they receive updates that require a reboot. The kured service triggers reboots of nodes one by one. Reboots can be disabled system-wide by annotating the Daemonset, see https://kured.dev/docs/operation/.
WARNING: untested!
An operating system update is not recommended, e.g. from Ubuntu 22.04 to 24.04. Instead, the corresponding nodes should be replaced!
default_image
. Attention: before changing the default
image, make sure that each node pool has its own oppropriate image
setting.terraform apply
The gateway will reappear again within a few minutes. This will disrupt the Internet access of the cluster's nodes for tasks like fetching package updates. However, it will not affect the services that are provided via load balancers!
After redeploying the gateway, ssh connections will fail because a new
cryptopraphic has been generated for the node. Delete the deprecated key from
the .ssh/known_hosts
file, open a new ssh connection and accept the new public
key.
Nodes should not be updated manually via agt-get
, but be replaced. For control
plane nodes, it is recommended to create a back-up of the etcd data store on an
external s3 storage, see k3s Cluster Datastore.
Start the replacement with the node pool with the cluster_can_init
setting:
cluster_can_init
setting.cluster_can_init
setting must be deleted and replaced in one
application of the configuration.
cluster_init_action.init
and cluster_init_action.reset
settings are disabled.kubectl drain node-xyz
kubectl delete node node-xyz
terraform apply
cluster_can_init
setting is
again up and running, the temporary control plane node pool can be deleted.Perform these steps for all remaining node pools:
image
setting to the new version.kubectl drain node-xyz
kubectl delete node node-xyz
terraform apply
image
variable in the configuration for future nodes to be
deployed with the correct image.helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --reuse-values cilium cilium/cilium -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/cilium/cilium
# WARNING: needs to be in line with the cluster configuration
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/8
ipam:
operator:
clusterPoolIPv4PodCIDRList: 10.244.0.0/16
k8sServiceHost: 10.0.1.1
k8sServicePort: "6443"
operator:
replicas: 2
helm repo add hcloud https://charts.hetzner.cloud
helm repo update
helm upgrade --reuse-values hcloud-ccm hcloud/hcloud-cloud-controller-manager -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/hetznercloud/hcloud-cloud-controller-manager/tree/main/chart
# WARNING: needs to be in line with the cluster configuration
networking:
enabled: true
clusterCIDR: 10.244.0.0/16
additionalTolerations:
# INFO: this taint occurred but isn't coveryd by default .. and caused the
# whole cluster to not start properly
- key: node.kubernetes.io/not-ready
value: NoSchedule
helm repo add hcloud https://charts.hetzner.cloud
helm repo update
helm upgrade --reuse-values hcloud-csi hcloud/hcloud-csi -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/hetznercloud/csi-driver/tree/main/chart
storageClasses:
- name: hcloud-volumes
defaultStorageClass: true
retainPolicy: Retain
helm repo add kubereboot https://kubereboot.github.io/charts
helm repo update
helm upgrade --reuse-values kured kubereboot/kured -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/kured/kured
configuration:
timeZone: Europe/Berlin
startTime: 1am
endTime: 5am
rebootDays:
- mo
- tu
- we
- th
- fr
- sa
- su
tolerations:
- key: CriticalAddonsOnly
operator: Exists
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo update
helm upgrade --reuse-values metrics-server metrics-server/metrics-server -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/metrics-server/metrics-server
helm repo add rancher https://charts.rancher.io
helm repo update
helm upgrade --reuse-values system-upgrade-controller rancher/system-upgrade-controller -n cattle-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/rancher/system-upgrade-controller
# Documentation: https://github.com/rancher/charts/tree/dev-v2.9/charts/system-upgrade-controller
global:
cattle:
psp:
enabled: false
After applying the Terraform plan you'll see several output variables like the load balancer's, control plane's, and node pools' IP addresses.
terraform destroy -force
Be sure to clean up any CSI created Block Storage Volumes, and CCM created NodeBalancers that you no longer require.
Ensure gateway is set up correctly: ./ssh-node gateway
iptables -L -t nat
# Expected output:
# Chain PREROUTING (policy ACCEPT)
# target prot opt source destination
#
# Chain INPUT (policy ACCEPT)
# target prot opt source destination
#
# Chain OUTPUT (policy ACCEPT)
# target prot opt source destination
#
# Chain POSTROUTING (policy ACCEPT)
# target prot opt source destination
# MASQUERADE all -- 10.0.1.0/24 anywhere
ufw status
# Expected output:
# Status: active
#
# To Action From
# -- ------ ----
# 22,6443/tcp ALLOW Anywhere
# 22,6443/tcp (v6) ALLOW Anywhere (v6)
#
# Anywhere on eth0 ALLOW FWD Anywhere on ens10
# Anywhere (v6) on eth0 ALLOW FWD Anywhere (v6) on ens10
ufw status
date
echo $LANG
# Retrieve status
cloud-init status
# Verify configuration
cloud-init schema --system
# Collect logs for inspection
cloud-init collect-logs
tar xvzf cloud-init.tar.gz
# Inspect cloud-init.log for error messages
# Quickly find runcmd
find /var/lib/cloud/instances -name runcmd
sh -ex PATH_TO_RUNCMD
Ensure cluster is set up correctly: ./ssh-node cluster
ip r s
# Expected output:
# default via 10.0.0.1 dev ens10 proto static onlink <-- this is the important line
# 10.0.0.0/8 via 10.0.0.1 dev ens10 proto dhcp src 10.0.1.2 metric 1024
# 10.0.0.1 dev ens10 proto dhcp scope link src 10.0.1.2 metric 1024
# 169.254.169.254 via 10.0.0.1 dev ens10 proto dhcp src 10.0.1.2 metric 1024
ping 1.1.1.1
# Expected output:
# PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
# 64 bytes from 1.1.1.1: icmp_seq=1 ttl=53 time=4.60 ms
# 64 bytes from 1.1.1.1: icmp_seq=2 ttl=53 time=6.82 ms
# ...
host k3s.io
# Expected output:
# k3s.io has address 185.199.108.153
# k3s.io has address 185.199.110.153
# k3s.io has address 185.199.111.153
# k3s.io has address 185.199.109.153
# ...
k3s kubectl get nodes
# Expected output:
# k3s.io has address 185.199.108.153
# k3s.io has address 185.199.110.153
# k3s.io has address 185.199.111.153
# k3s.io has address 185.199.109.153
# ...
This command only works after installing the cilium cli.
cilium status
# Expected output:
# /¯¯\
# /¯¯\__/¯¯\ Cilium: OK
# \__/¯¯\__/ Operator: OK
# /¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
# \__/¯¯\__/ Hubble Relay: disabled
# \__/ ClusterMesh: disabled
#
# Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
# DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
# Containers: cilium Running: 3
# cilium-operator Running: 1
# Cluster Pods: 9/9 managed by Cilium
# Helm chart version: 1.14.5
# Image versions cilium quay.io/cilium/cilium:v1.14.5@sha256:d3b287029755b6a47dee01420e2ea469469f1b174a2089c10af7e5e9289ef05b: 3
# cilium-operator quay.io/cilium/operator-generic:v1.14.5@sha256:303f9076bdc73b3fc32aaedee64a14f6f44c8bb08ee9e3956d443021103ebe7a: 1
This command only works out of the box on the first node of the control plane
node pool with the cluster_can_init
setting.
k3s check-config
# Expected output:
# ...
# STATUS: pass
systemctl status k3s.service
journalctl -u k3s.service