kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.23k stars 344 forks source link

Autoscaler not scaling #537

Closed AlexProgrammerDE closed 1 year ago

AlexProgrammerDE commented 1 year ago

Hello! I've been trying out this project and didn't manage to get autoscaling working. When my pods try to deploy, it just tells me that only two nodes are available, one of them is the control panel and the other one the first agent. My config uses these values:

control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-large",
      server_type = "cx51",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count = 1
    }
  ]
  # Add custom control plane configuration options here.
  # E.g to enable monitoring for etcd, proxy etc:
  control_planes_custom_config = {
   etcd-expose-metrics = true,
   kube-controller-manager-arg = "bind-address=0.0.0.0",
   kube-proxy-arg ="metrics-bind-address=0.0.0.0",
   kube-scheduler-arg = "bind-address=0.0.0.0",
  }

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  # Cluster Autoscaler
  # Providing at least one map for the array enables the cluster autoscaler feature, default is disabled
  # Please note that the autoscaler should not be used with initial_k3s_channel < "v1.25". So ideally lock it to "v1.25".
  # * Example below:
  autoscaler_nodepools = [
    {
      name        = "autoscaler"
      server_type = "cpx11" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "nbg1"
      min_nodes   = 1
      max_nodes   = 5
    }
  ]

These are the currently deployed servers: image It appears that the server for autoscaling is missing. I am using the latest version 1.8.1. The terraform config applies without any errors:

module.kube-hetzner.data.github_release.hetzner_csi[0]: Reading...
module.kube-hetzner.data.github_release.hetzner_ccm[0]: Reading...
module.kube-hetzner.data.github_release.kured[0]: Reading...
module.kube-hetzner.random_password.k3s_token: Refreshing state... [id=none]
module.kube-hetzner.random_password.rancher_bootstrap[0]: Refreshing state... [id=none]
module.kube-hetzner.data.hcloud_servers.autoscaled_nodes["autoscaler"]: Reading...
module.kube-hetzner.hcloud_placement_group.agent[0]: Refreshing state... [id=118111]
module.kube-hetzner.hcloud_ssh_key.k3s[0]: Refreshing state... [id=9851167]
module.kube-hetzner.data.hcloud_ssh_keys.keys_by_selector[0]: Reading...
module.kube-hetzner.hcloud_network.k3s: Refreshing state... [id=2442188]
module.kube-hetzner.hcloud_placement_group.control_plane[0]: Refreshing state... [id=118112]
module.kube-hetzner.hcloud_firewall.k3s: Refreshing state... [id=693380]
module.kube-hetzner.data.hcloud_ssh_keys.keys_by_selector[0]: Read complete after 0s [id=xxx]
module.kube-hetzner.data.hcloud_servers.autoscaled_nodes["autoscaler"]: Read complete after 0s [id=da39a3ee5e6b4b0d3255bfef95601890afd80709]
module.kube-hetzner.data.github_release.hetzner_csi[0]: Read complete after 0s [id=89711362]
module.kube-hetzner.data.github_release.kured[0]: Read complete after 0s [id=86062001]
module.kube-hetzner.data.github_release.hetzner_ccm[0]: Read complete after 1s [id=79020566]
module.kube-hetzner.hcloud_network_subnet.control_plane[0]: Refreshing state... [id=2442188-10.255.0.0/16]
module.kube-hetzner.hcloud_network_subnet.agent[0]: Refreshing state... [id=2442188-10.0.0.0/16]
module.kube-hetzner.module.agents["0-0-agent-large"].random_string.identity_file: Refreshing state... [id=c1ayyaic8t1bu9ehqsrk]
module.kube-hetzner.module.agents["0-0-agent-large"].random_string.server: Refreshing state... [id=abo]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].random_string.server: Refreshing state... [id=zew]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].random_string.identity_file: Refreshing state... [id=nigvg3zcta1ujdo5ttpp]
module.kube-hetzner.module.agents["0-0-agent-large"].data.cloudinit_config.config: Reading...
module.kube-hetzner.module.agents["0-0-agent-large"].data.cloudinit_config.config: Read complete after 0s [id=2073267425]
module.kube-hetzner.module.agents["0-0-agent-large"].hcloud_server.server: Refreshing state... [id=27972432]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].data.cloudinit_config.config: Reading...
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].data.cloudinit_config.config: Read complete after 0s [id=280194929]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].hcloud_server.server: Refreshing state... [id=27972433]
module.kube-hetzner.module.agents["0-0-agent-large"].null_resource.registries: Refreshing state... [id=7315229217177627871]
module.kube-hetzner.module.agents["0-0-agent-large"].hcloud_server_network.server: Refreshing state... [id=27972432-2442188]
module.kube-hetzner.module.agents["0-0-agent-large"].hcloud_rdns.server[0]: Refreshing state... [id=s-27972432-x.x.x.x]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].hcloud_rdns.server[0]: Refreshing state... [id=s-27972433-x.x.x.x]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].hcloud_server_network.server: Refreshing state... [id=27972433-2442188]
module.kube-hetzner.module.control_planes["0-0-control-plane-nbg1"].null_resource.registries: Refreshing state... [id=6423485608408018346]
module.kube-hetzner.null_resource.first_control_plane: Refreshing state... [id=7070802316778084546]
module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]: Refreshing state... [id=97394711]
module.kube-hetzner.data.cloudinit_config.autoscaler-config[0]: Reading...
module.kube-hetzner.null_resource.kustomization: Refreshing state... [id=2641344090826294402]
module.kube-hetzner.null_resource.agents["0-0-agent-large"]: Refreshing state... [id=9218326134874302650]
module.kube-hetzner.null_resource.control_planes["0-0-control-plane-nbg1"]: Refreshing state... [id=8516690779649704765]
module.kube-hetzner.data.cloudinit_config.autoscaler-config[0]: Read complete after 0s [id=3791662826]
module.kube-hetzner.data.hcloud_load_balancer.cluster[0]: Reading...
module.kube-hetzner.data.remote_file.kustomization_backup: Reading...
module.kube-hetzner.null_resource.destroy_cluster_loadbalancer: Refreshing state... [id=4664514048567351556]
module.kube-hetzner.data.remote_file.kubeconfig: Reading...
module.kube-hetzner.data.hcloud_load_balancer.cluster[0]: Read complete after 0s [name=horizon]
module.kube-hetzner.null_resource.configure_autoscaler[0]: Refreshing state... [id=7199351183629787419]
module.kube-hetzner.data.remote_file.kustomization_backup: Read complete after 1s [id=x.x.x.x:22:/var/post_install/kustomization.yaml]
module.kube-hetzner.data.remote_file.kubeconfig: Read complete after 1s [id=x.x.x.x:22:/etc/rancher/k3s/k3s.yaml]
module.kube-hetzner.local_sensitive_file.kubeconfig[0]: Refreshing state... [id=1765eebe4cd1e614b91891f952a5fc68a31003d3]
module.kube-hetzner.local_file.kustomization_backup[0]: Refreshing state... [id=62c320e17ab5f99388d5b1b2c5a17cfa7564e67d]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

kubeconfig = <sensitive>

These are the logs of cluster-autoscaler: image And here is the error for the pods that can't start: image It appears that the autoscaler server was not deployed and so doesn't start new nodes. I am using the HorizontalPodAutoscaler, but setting replicas manually on the deployment didn't work either. I know when I first tried this project, there was a machine that had the name "autoscaler" inside it's name. With the newest versions, that machine just doesn't exist. No errors. And kubernetes finds no way of deploying more nodes. What could be the issue?

captnCC commented 1 year ago

@mysticaltech Both are cpx11 instances, so it should work in theory 🤔

The limited logs give me the impression, that the autoscaler thinks there is no way to scale up the cluster.

AlexProgrammerDE commented 1 year ago

The entire autoscaler machine isn't even deployed to hcloud. I understood the autoscaler machine as a type of manager for deploying more nodes to the cluster? It doesn't appear to exist. The snapshot exists though.

mysticaltech commented 1 year ago

Yes @captnCC, when I realized that I deleted my comment.

@AlexProgrammerDE Are you actually defining workload requirements in your pods and/or deployments? Because this is a must. See https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/353#issue-1411712748

FYI @ifeulner.

AlexProgrammerDE commented 1 year ago

@mysticaltech yes, here is my kubernetes file I made:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multipaper
  labels:
    app: multipaper
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  revisionHistoryLimit: 0
  selector:
    matchLabels:
        app: multipaper
  template:
    metadata:
      labels:
          app: multipaper
    spec:
      imagePullSecrets:
        - name: regcred
      volumes:
        - name: mp-files
          nfs:
            server: 10.255.0.2
            path: /nfs/multipaper
            readOnly: no
      terminationGracePeriodSeconds: 10
      containers:
        - name: multipaper
          image: alexprogrammerde/multipaper-kubernetes:multipaper
          tty: true
          imagePullPolicy: Always
          resources:
            requests:
              memory: "16Gi"
              cpu: "4"
            limits:
              memory: "16Gi"
              cpu: "8"
          volumeMounts:
            - name: mp-files
              mountPath: /files
          ports:
            - name: game-port
              containerPort: 25565
          readinessProbe:
            exec:
              command:
                - mcstatus
                - 127.0.0.1
                - ping
            initialDelaySeconds: 30
            periodSeconds: 30
          livenessProbe:
            exec:
              command:
                - mcstatus
                - 127.0.0.1
                - ping
            initialDelaySeconds: 30
            periodSeconds: 30
          env:
            - name: KN_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: KN_POD_IP
              value: 0.0.0.0
            - name: MASTER_ADDRESS
              value: "10.255.0.2:35353"
            - name: USE_RAM
              value: "14G"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: multipaper-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: multipaper
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
AlexProgrammerDE commented 1 year ago

I have another pod running that fills like 80% of resources on a node. It just doesn't seem to create a new node that hosts the replicas.

mysticaltech commented 1 year ago

@ifeulner @captnCC If you guys have any ideas on what might be happening it would be awesome, as I do not have much experience with the autoscaler feature myself 🙏

AlexProgrammerDE commented 1 year ago

Here is my full kube.tf file for reproducing:

kube.tf ````tf module "kube-hetzner" { providers = { hcloud = hcloud } hcloud_token = local.hcloud_token # Then fill or edit the below values. Only the first values starting with a * are obligatory; the rest can remain with their default values, or you # could adapt them to your needs. # * For local dev, path to the git repo # source = "../../kube-hetzner/" # If you want to use the latest master branch # source = "github.com/kube-hetzner/terraform-hcloud-kube-hetzner" # For normal use, this is the path to the terraform registry source = "kube-hetzner/kube-hetzner/hcloud" # you can optionally specify a version number # version = "1.2.0" # Note that some values, notably "location" and "public_key" have no effect after initializing the cluster. # This is to keep Terraform from re-provisioning all nodes at once, which would lose data. If you want to update # those, you should instead change the value here and manually re-provision each node. Grep for "lifecycle". # Customize the SSH port (by default 22) # ssh_port = 2222 # * Your ssh public key ssh_public_key = file("/home/alex/.ssh/id_horizon.pub") # * Your private key must be "ssh_private_key = null" when you want to use ssh-agent for a Yubikey-like device authentification or an SSH key-pair with a passphrase. # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md ssh_private_key = file("/home/alex/.ssh/id_horizon") # You can add additional SSH public Keys to grant other team members root access to your cluster nodes. # ssh_additional_public_keys = [] # You can also add additional SSH public Keys which are saved in the hetzner cloud by a label. # See https://docs.hetzner.cloud/#label-selector ssh_hcloud_key_label = "role=admin" # If you want to use an ssh key that is already registered within hetzner cloud, you can pass its id. # If no id is passed, a new ssh key will be registered within hetzner cloud. # It is important that exactly this key is passed via `ssh_public_key` & `ssh_private_key` vars. # hcloud_ssh_key_id = "" # These can be customized, or left with the default values # * For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/ network_region = "eu-central" # change to `us-east` if location is ash # For the control planes, at least three nodes are the minimum for HA. Otherwise, you need to turn off the automatic upgrade (see ReadMe). # As per Rancher docs, it must always be an odd number, never even! See https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/ # For instance, one is ok (non-HA), two is not ok, and three is ok (becomes HA). It does not matter if they are in the same nodepool or not! So they can be in different locations and of various types. # Of course, you can choose any number of nodepools you want, with the location you want. The only constraint on the location is that you need to stay in the same network region, Europe, or the US. # For the server type, the minimum instance supported is cpx11 (just a few cents more than cx11); see https://www.hetzner.com/cloud. # IMPORTANT: Before you create your cluster, you can do anything you want with the nodepools, but you need at least one of each control plane and agent. # Once the cluster is up and running, you can change nodepool count and even set it to 0 (in the case of the first control-plane nodepool, the minimum is 1), # you can also rename it (if the count is 0), but do not remove a nodepool from the list. # The only nodepools that are safe to remove from the list when you edit it are at the end of the lists. That is due to how subnets and IPs get allocated (FILO). # You can, however, freely add other nodepools at the end of each list if you want! The maximum number of nodepools you can create combined for both lists is 255. # Also, before decreasing the count of any nodepools to 0, it's essential to drain and cordon the nodes in question. Otherwise, it will leave your cluster in a bad state. # Before initializing the cluster, you can change all parameters and add or remove any nodepools. You need at least one nodepool of each kind, control plane, and agent. # The nodepool names are entirely arbitrary, you can choose whatever you want, but no special characters or underscore, and they must be unique; only alphanumeric characters and dashes are allowed. # If you want to have a single node cluster, have one control plane nodepools with a count of 1, and one agent nodepool with a count of 0. # Please note that changing labels and taints after the first run will have no effect. If needed, you will need to do that through Kubernetes directly. # * Example below: control_plane_nodepools = [ { name = "control-plane-nbg1", server_type = "cpx11", location = "nbg1", labels = [], taints = [], count = 1 } ] agent_nodepools = [ { name = "agent-large", server_type = "cx51", location = "nbg1", labels = [], taints = [], count = 1 } ] # Add custom control plane configuration options here. # E.g to enable monitoring for etcd, proxy etc: control_planes_custom_config = { etcd-expose-metrics = true, kube-controller-manager-arg = "bind-address=0.0.0.0", kube-proxy-arg ="metrics-bind-address=0.0.0.0", kube-scheduler-arg = "bind-address=0.0.0.0", } # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer load_balancer_type = "lb11" load_balancer_location = "nbg1" ### The following values are entirely optional (and can be removed from this if unused) # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner base_domain = "horizon.example.com" # I replaced this # Cluster Autoscaler # Providing at least one map for the array enables the cluster autoscaler feature, default is disabled # Please note that the autoscaler should not be used with initial_k3s_channel < "v1.25". So ideally lock it to "v1.25". # * Example below: autoscaler_nodepools = [ { name = "autoscaler" server_type = "cpx11" # must be same or better than the control_plane server type (regarding disk size)! location = "nbg1" min_nodes = 1 max_nodes = 5 } ] # Enable etcd snapshot backups to S3 storage. # Just provide a map with the needed settings (according to your S3 storage provider) and backups to S3 will # be enabled (with the default settings for etcd snapshots). # For proper context, please have a look at https://docs.k3s.io/backup-restore. # etcd_s3_backup = { # etcd-s3-endpoint = "xxxx.r2.cloudflarestorage.com" # etcd-s3-access-key = "" # etcd-s3-secret-key = "" # etcd-s3-bucket = "k3s-etcd-snapshots" # } # To use local storage on the nodes, you can enable Longhorn, default is "false". # See a full recap on how to configure agent nodepools for longhorn here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/373#discussioncomment-3983159 # enable_longhorn = true # By default, longhorn is pulled from https://charts.rancher.io which assures compatibility with rancher. # If you need a newer version of longhorn you can set this variable to https://charts.longhorn.io. # longhorn_repository = "https://charts.rancher.io" # The namespace for longhorn deployment, default is "longhorn-system" # longhorn_namespace = "longhorn-system" # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs) # longhorn_fstype = "xfs" # how many replica volumes should longhorn create (default is 3) # longhorn_replica_count = 1 # When you enable Longhorn, you can go with the default settings and just modify the above two variables OR you can add a longhorn_values variable # with all needed helm values, see towards the end of the file in the advanced section. # If that file is present, the system will use it during the deploy, if not it will use the default values with the two variable above that can be customized. # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration. # Also, you choose to create a hetzner volume to be used with Longhorn. By default, it will use the nodes own storage space, BUT if you an attribute of # longhorn_volume_size (⚠️ not a variable, just a possible agent nodepool attribute) with a value of 10 to 10000 GB to your agent nodepool definition, it will create and use the volume in question. # See the agent nodepool section for an example of how to do that. # To disable Hetzner CSI storage, you can set the following to true, default is "false". # disable_hetzner_csi = true # If you want to use a specific Hetzner CCM and CSI version, set them below; otherwise, leave them as-is for the latest versions. # hetzner_ccm_version = "" # hetzner_csi_version = "" # If you want to specify the Kured version, set it below - otherwise it'll use the latest version available. # kured_version = "" # If you want to enable the Nginx ingress controller (https://kubernetes.github.io/ingress-nginx/) instead of Traefik, you can set this to "true". Default is "false". # FOR THIS TO NOT BE IGNORED, you also need to set "enable_traefik = false". # By the default we load an optimal Nginx ingress controller config for Hetzner, however you may need to tweak it to your needs, so to do, # we allow you to add a nginx_ingress_values, see towards the end of this file in the advanced section. # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration. # enable_nginx = true # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true". # enable_traefik = false # Use the klipper LB (similar to metalLB), instead of the default Hetzner one, that has an advantage of dropping the cost of the setup. # Automatically "true" in the case of single node cluster (as it does not make sense to use the Hetzner LB in that situation). # It can work with any ingress controller that you choose to deploy. # Please note that because the klipper lb points to all nodes, we automatically allow scheduling on the control plane when it is active. # enable_klipper_metal_lb = "true" # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal, # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management, # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/. # traefik_acme_tls = true # traefik_acme_email = "mail@example.com" # If you want to configure additional Arguments for traefik, enter them here as a list and in the form of traefik CLI arguments; see https://doc.traefik.io/traefik/reference/static-configuration/cli/ # They are the options that go into the additionalArguments section of the Traefik helm values file. # Example: traefik_additional_options = ["--log.level=DEBUG", "--tracing=true"] # traefik_additional_options = [] # If you want to disable the metric server, you can! Default is "true". enable_metrics_server = true # If you want to allow non-control-plane workloads to run on the control-plane nodes, set "true" below. The default is "false". # True by default for single node clusters, and when enable_klipper_metal_lb is true. In those cases, the value below will be ignored. # allow_scheduling_on_control_plane = true # If you want to disable the automatic upgrade of k3s, you can set this to false. # Ideally, keep it on, to always have the latest and greatest Kubernetes version, but lock the initial_k3s_channel to a kube major version, # of your choice, like v1.24 or v1.25. That way you get the best of both worlds without the breaking changes risk. # The default is "true" (If you are in HA i.e. at least 3 control plane nodes & 2 agents, just keep it, it works great!) automatically_upgrade_k3s = false # For non-HA clusters i.e. when the number of control-plane nodes is < 3, you have to turn it off. # Ideally, for production use, always use an HA setup with at least 3 control-plane nodes and 2 agents, and keep this on for max security. # The default is "true" (in HA it works wonderfully well, with automatically roll-back to the previous snapshot in case of an issue). automatically_upgrade_os = false # If you need more control over kured and the reboot behaviour, you can pass additional options to kured. # For example limiting reboots to certain timeframes. For all options see: https://kured.dev/docs/configuration/ # The default options are: `--reboot-command=/usr/bin/systemctl reboot --pre-reboot-node-labels=kured=rebooting --post-reboot-node-labels=kured=done --period=5m` # Defaults can be overridden by using the same key. # kured_options = { # "reboot-days": "su" # "start-time": "9am" # "end-time": "5pm" # } # Allows you to specify either stable, latest, testing or supported minor versions (defaults to stable) # see https://rancher.com/docs/k3s/latest/en/upgrades/basic/ and https://update.k3s.io/v1-release/channels # ⚠️ If you are going to use Rancher addons for instance, it's always a good idea to fix the kube version to latest - 0.01, # at the time of writing the latest is v1.25, so setting the value below to "v1.24" will insure maximum compatibility with Rancher, Longhorn and so on! # The default is "v1.24". # initial_k3s_channel = "stable" # The cluster name, by default "k3s" cluster_name = "horizon" # Whether to use the cluster name in the node name, in the form of {cluster_name}-{nodepool_name}, the default is "true". # use_cluster_name_in_node_name = false # Extra k3s registries. This is useful if you have private registries and you # want to pull images without additional secrets. # registries.yaml file docs: https://docs.k3s.io/installation/private-registry /* k3s_registries = <<-EOT mirrors: hub.my_registry.com: endpoint: - "hub.my_registry.com" configs: hub.my_registry.com: auth: username: username password: password EOT */ # Adding extra firewall rules, like opening a port # More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall # extra_firewall_rules = [ # # For Postgres # { # direction = "in" # protocol = "tcp" # port = "5432" # source_ips = ["0.0.0.0/0", "::/0"] # destination_ips = [] # Won't be used for this rule # }, # # To Allow ArgoCD access to resources via SSH # { # direction = "out" # protocol = "tcp" # port = "22" # source_ips = [] # Won't be used for this rule # destination_ips = ["0.0.0.0/0", "::/0"] # } # ] # If you want to configure a different CNI for k3s, use this flag # possible values: flannel (Default), calico, and cilium # As for Cilium, we allow infinite configurations via helm values, please check the CNI section of the readme over at https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/#cni. # Also, see the cilium_values at towards the end of this file, in the advanced section. # cni_plugin = "cilium" # If you want to disable the k3s default network policy controller, use this flag! # Both Calico and Ciliun cni_plugin values override this value to true automatically, the default is "false". # disable_network_policy = true # If you want to disable the automatic use of placement group "spread". See https://docs.hetzner.com/cloud/placement-groups/overview/ # That may be useful if you need to deploy more than 500 nodes! The default is "false". # placement_group_disable = true # By default, we allow ICMP ping in to the nodes, to check for liveness for instance. If you do not want to allow that, you can. Just set this flag to true (false by default). # block_icmp_ping_in = true # You can enable cert-manager (installed by Helm behind the scenes) with the following flag, the default is "false". # enable_cert_manager = true # By default we use a known good mirror to download the needed OpenSUSE, but for instance, it may not be available in US-East location, in which case, # you can find a working mirror for you at https://download.opensuse.org/tumbleweed/appliances/openSUSE-MicroOS.x86_64-OpenStack-Cloud.qcow2.mirrorlist, # Or use the value below to automatically select an optimal mirror (by default we have it fixed to one german mirror that we know works for most locations) # opensuse_microos_mirror_link = "https://download.opensuse.org/tumbleweed/appliances/openSUSE-MicroOS.x86_64-OpenStack-Cloud.qcow2" # IP Addresses to use for the DNS Servers, set to an empty list to use the ones provided by Hetzner, defaults to ["1.1.1.1", " 1.0.0.1", "8.8.8.8"]. # For rancher installs, best to leave it as default. # dns_servers = [] # When this is enabled, rather than the first node, all external traffic will be routed via a control-plane loadbalancer, allowing for high availability. # The default is false. # use_control_plane_lb = true # Let's say you are not using the control plane LB solution above, and still want to have one hostname point to all your control-plane nodes. # You could create multiple A records of to let's say cp.cluster.my.org pointing to all of your control-plane nodes ips. # In which case, you need to define that hostname in the k3s TLS-SANs config to allow connection through it. It can be hostnames or IP addresses. # additional_tls_sans = ["cp.cluster.my.org"] # Oftentimes, you need to communicate to the cluster from inside the cluster itself, in which case it is important to set this value, as it will configure the hostname # at the load balancer level, and will save you from many slows downs when initiating communications from inside. Later on, you can point your DNS to the IP given # to the LB. And if you have other services pointing to it, you are also free to create CNAMES to point to it, or whatever you see fit. # If set, it will apply to either ingress controllers, Traefik or Ingress-Nginx. # lb_hostname = "" # You can enable Rancher (installed by Helm behind the scenes) with the following flag, the default is "false". # When Rancher is enabled, it automatically installs cert-manager too, and it uses rancher's own self-signed certificates. # See for options https://rancher.com/docs/rancher/v2.0-v2.4/en/installation/resources/advanced/helm2/helm-rancher/#choose-your-ssl-configuration # The easiest thing is to leave everything as is (using the default rancher self-signed certificate) and put Cloudflare in front of it. # As for the number of replicas, by default it is set to the numbe of control plane nodes. # You can customized all of the above by adding a rancher_values variable see at the end of this file in the advanced section. # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration. # IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane). # ALSO, in order for Rancher to successfully deploy, you have to set the "rancher_hostname". # enable_rancher = true # If using Rancher you can set the Rancher hostname, it must be unique hostname even if you do not use it. # If not pointing the DNS, you can just port-forward locally via kubectl to get access to the dashboard. # If you already set the lb_hostname above and are using a Hetzner LB, you do not need to set this one, as it will be used by default. # But if you set this one explicitly, it will have preference over the lb_hostname in rancher settings. # rancher_hostname = "rancher.xyz.dev" # When Rancher is deployed, by default is uses the "latest" channel. But this can be customized. # The allowed values are "stable" or "latest". # rancher_install_channel = "stable" # Finally, you can specify a bootstrap-password for your rancher instance. Minimum 48 characters long! # If you leave empty, one will be generated for you. # (Can be used by another rancher2 provider to continue setup of rancher outside this module.) # rancher_bootstrap_password = "" # Separate from the above Rancher config (only use one or the other). You can import this cluster directly on an # an already active Rancher install. By clicking "import cluster" choosing "generic", giving it a name and pasting # the cluster registration url below. However, you can also ignore that and apply the url via kubectl as instructed # by Rancher in the wizard, and that would register your cluster too. # More information about the registration can be found here https://rancher.com/docs/rancher/v2.6/en/cluster-provisioning/registered-clusters/ # rancher_registration_manifest_url = "https://rancher.xyz.dev/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml" # Extra values that will be passed to the `extra-manifests/kustomization.yaml.tpl` if its present. # extra_kustomize_parameters={} extra_packages_to_install=["nfs-utils", "nfs-client"] # It is best practice to turn this off, but for backwards compatibility it is set to "true" by default. # See https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/349 # When "false". The kubeconfig file can instead be created by executing: "terraform output --raw kubeconfig > cluster_kubeconfig.yaml" # Always be careful to not commit this file! # create_kubeconfig = false # Don't create the kustomize backup. This can be helpful for automation. # create_kustomization = false ### ADVANCED - Custom helm values for packages above (search _values if you want to located where those are mentioned upper in this file) # ⚠️ Inside the _values variable below are examples, up to you to find out the best helm values possible, we do not provide support for customized helm values. # Please understand that the indentation is very important, inside the EOTs, as those are proper yaml helm values. # We advise you to use the default values, and only change them if you know what you are doing! # Cilium, all Cilium helm values can be found at https://github.com/cilium/cilium/blob/master/install/kubernetes/cilium/values.yaml # The following is an example, please note that the current indentation inside the EOT is important. /* cilium_values = <
captnCC commented 1 year ago

The entire autoscaler machine isn't even deployed to hcloud. I understood the autoscaler machine as a type of manager for deploying more nodes to the cluster? It doesn't appear to exist. The snapshot exists though.

This is a misconception, the autoscaler is just a service running inside kubernetes that uses hcloud to provision additional nodes. There is no additional server running for the autoscaler. The server you saw in hcloud, must have been an actual scaled node.

Looking at your deployment manifest: You are requesting 16 gigs memory per container but the nodes only have 2 or 4 Gigs per machine, so the autoscaler has no possibility to provision machines with the needed resources!

AlexProgrammerDE commented 1 year ago

Ahh makes sense. I got confused what the machine type was supposed to mean inside the autoscaler nodepool setting. I will give it a try with a better node type in the autoscaler nodepool later.

aleksasiriski commented 1 year ago

The entire autoscaler machine isn't even deployed to hcloud. I understood the autoscaler machine as a type of manager for deploying more nodes to the cluster? It doesn't appear to exist. The snapshot exists though.

This is a misconception, the autoscaler is just a service running inside kubernetes that uses hcloud to provision additional nodes. There is no additional server running for the autoscaler. The server you saw in hcloud, must have been an actual scaled node.

Looking at your deployment manifest: You are requesting 16 gigs memory per container but the nodes only have 2 or 4 Gigs per machine, so the autoscaler has no possibility to provision machines with the needed resources!

This must be it, @AlexProgrammerDE make an autoscaling pool with Hetzner servers that have the resources you request in Deployment yaml or bring down your request of memory to something lower than the memory of your chosen server.

I think this is just a simple mistake of using cpx11 in autoscaling pool, and the wanted value there is cx51.

ifeulner commented 1 year ago

I think this is just a simple mistake of using cpx11 in autoscaling pool, and the wanted value there is cx51.

Yes, if you have such huge memory requirements the underlying nodes must of course fulfill this. BTW, you usually don't need an explicit HorizontalPodAutoscaler deployment, that should work automatically if you have correct limits / requests assigned to your deployments.

AlexProgrammerDE commented 1 year ago

Alright good news! Changing the node type seems to have made it scale them! image

BUT there just appeared a new issue. The autoscaled nodes don't connect to the cluster, so the pods don't start on them. There is this error in the machine on boot: image Should I open a new GitHub issue?

aleksasiriski commented 1 year ago

Did you run terraform destroy before terraform apply?

mysticaltech commented 1 year ago

@aleksasiriski To recreate the snapshot huh, good point!

AlexProgrammerDE commented 1 year ago

Yep it is a clean install. I upgraded to 1.8.2 and destroyed before that.

AlexProgrammerDE commented 1 year ago

It seems there are scripts missing in the autoscaler image?

mysticaltech commented 1 year ago

@AlexProgrammerDE Please SSH into that node (see Readme), and get the k3s-agent logs with journalctl -u k3s-agent, and if possible paste them here, or include in a file.

We need to understand why the k3s service is failing to start!

mysticaltech commented 1 year ago

Also, please do paste again in here your updated kube.tf.

AlexProgrammerDE commented 1 year ago

image Perhaps it doesn't exist in the image? I will try anyway later.

mysticaltech commented 1 year ago

Try "k3s" then!

AlexProgrammerDE commented 1 year ago

Here: image Anything else I should try?

mysticaltech commented 1 year ago

Ok, in that case, please post all of the info we need to reproduce, kube.tf, and autoscaler config, workload etc.

We will have a look.

mysticaltech commented 1 year ago

If you want, you can dive into the cloudinit logic of the autoscaler-agents.tf file in the project, and try to figure out why k3s is not running on that node.

There is an error somewhere, we need to find it!

AlexProgrammerDE commented 1 year ago

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token

  # Then fill or edit the below values. Only the first values starting with a * are obligatory; the rest can remain with their default values, or you
  # could adapt them to your needs.

  # * For local dev, path to the git repo
  # source = "../../kube-hetzner/"
  # If you want to use the latest master branch
  # source = "github.com/kube-hetzner/terraform-hcloud-kube-hetzner"
  # For normal use, this is the path to the terraform registry
  source = "kube-hetzner/kube-hetzner/hcloud"

  # you can optionally specify a version number
  # version = "1.2.0"

  # Note that some values, notably "location" and "public_key" have no effect after initializing the cluster.
  # This is to keep Terraform from re-provisioning all nodes at once, which would lose data. If you want to update
  # those, you should instead change the value here and manually re-provision each node. Grep for "lifecycle".

  # Customize the SSH port (by default 22)
  # ssh_port = 2222

  # * Your ssh public key
  ssh_public_key = file("/home/alex/.ssh/id_horizon.pub")
  # * Your private key must be "ssh_private_key = null" when you want to use ssh-agent for a Yubikey-like device authentification or an SSH key-pair with a passphrase.
  # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_private_key = file("/home/alex/.ssh/id_horizon")
  # You can add additional SSH public Keys to grant other team members root access to your cluster nodes.
  # ssh_additional_public_keys = []

  # You can also add additional SSH public Keys which are saved in the hetzner cloud by a label.
  # See https://docs.hetzner.cloud/#label-selector
  ssh_hcloud_key_label = "role=admin"

  # If you want to use an ssh key that is already registered within hetzner cloud, you can pass its id.
  # If no id is passed, a new ssh key will be registered within hetzner cloud.
  # It is important that exactly this key is passed via `ssh_public_key` & `ssh_private_key` vars.
  # hcloud_ssh_key_id = ""

  # These can be customized, or left with the default values
  # * For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
  network_region = "eu-central" # change to `us-east` if location is ash

  # For the control planes, at least three nodes are the minimum for HA. Otherwise, you need to turn off the automatic upgrade (see ReadMe).
  # As per Rancher docs, it must always be an odd number, never even! See https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/
  # For instance, one is ok (non-HA), two is not ok, and three is ok (becomes HA). It does not matter if they are in the same nodepool or not! So they can be in different locations and of various types.

  # Of course, you can choose any number of nodepools you want, with the location you want. The only constraint on the location is that you need to stay in the same network region, Europe, or the US.
  # For the server type, the minimum instance supported is cpx11 (just a few cents more than cx11); see https://www.hetzner.com/cloud.

  # IMPORTANT: Before you create your cluster, you can do anything you want with the nodepools, but you need at least one of each control plane and agent.
  # Once the cluster is up and running, you can change nodepool count and even set it to 0 (in the case of the first control-plane nodepool, the minimum is 1),
  # you can also rename it (if the count is 0), but do not remove a nodepool from the list.

  # The only nodepools that are safe to remove from the list when you edit it are at the end of the lists. That is due to how subnets and IPs get allocated (FILO).
  # You can, however, freely add other nodepools at the end of each list if you want! The maximum number of nodepools you can create combined for both lists is 255.
  # Also, before decreasing the count of any nodepools to 0, it's essential to drain and cordon the nodes in question. Otherwise, it will leave your cluster in a bad state.

  # Before initializing the cluster, you can change all parameters and add or remove any nodepools. You need at least one nodepool of each kind, control plane, and agent.
  # The nodepool names are entirely arbitrary, you can choose whatever you want, but no special characters or underscore, and they must be unique; only alphanumeric characters and dashes are allowed.

  # If you want to have a single node cluster, have one control plane nodepools with a count of 1, and one agent nodepool with a count of 0.

  # Please note that changing labels and taints after the first run will have no effect. If needed, you will need to do that through Kubernetes directly.

  # * Example below:

  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent",
      server_type = "cx51",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count = 1
    }
  ]
  # Add custom control plane configuration options here.
  # E.g to enable monitoring for etcd, proxy etc:
  control_planes_custom_config = {
   etcd-expose-metrics = true,
   kube-controller-manager-arg = "bind-address=0.0.0.0",
   kube-proxy-arg ="metrics-bind-address=0.0.0.0",
   kube-scheduler-arg = "bind-address=0.0.0.0",
  }

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ### The following values are entirely optional (and can be removed from this if unused)

  # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
  base_domain = "horizon.example.com" # Removed by me

  # Cluster Autoscaler
  # Providing at least one map for the array enables the cluster autoscaler feature, default is disabled
  # Please note that the autoscaler should not be used with initial_k3s_channel < "v1.25". So ideally lock it to "v1.25".
  # * Example below:
  autoscaler_nodepools = [
    {
      name        = "autoscaler"
      server_type = "cx51" # must be same or better than the control_plane server type (regarding disk size)!
      location    = "nbg1"
      min_nodes   = 1
      max_nodes   = 15
    }
  ]

  # Enable etcd snapshot backups to S3 storage.
  # Just provide a map with the needed settings (according to your S3 storage provider) and backups to S3 will
  # be enabled (with the default settings for etcd snapshots).
  # For proper context, please have a look at https://docs.k3s.io/backup-restore.
  # etcd_s3_backup = {
  #   etcd-s3-endpoint        = "xxxx.r2.cloudflarestorage.com"
  #   etcd-s3-access-key      = "<access-key>"
  #   etcd-s3-secret-key      = "<secret-key>"
  #   etcd-s3-bucket          = "k3s-etcd-snapshots"
  # }

  # To use local storage on the nodes, you can enable Longhorn, default is "false".
  # See a full recap on how to configure agent nodepools for longhorn here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/373#discussioncomment-3983159
  # enable_longhorn = true

  # By default, longhorn is pulled from https://charts.rancher.io which assures compatibility with rancher.
  # If you need a newer version of longhorn you can set this variable to https://charts.longhorn.io. 
  # longhorn_repository = "https://charts.rancher.io"

  # The namespace for longhorn deployment, default is "longhorn-system"
  # longhorn_namespace = "longhorn-system"

  # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
  # longhorn_fstype = "xfs"

  # how many replica volumes should longhorn create (default is 3)
  # longhorn_replica_count = 1

  # When you enable Longhorn, you can go with the default settings and just modify the above two variables OR you can add a longhorn_values variable
  # with all needed helm values, see towards the end of the file in the advanced section.
  # If that file is present, the system will use it during the deploy, if not it will use the default values with the two variable above that can be customized.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.

  # Also, you choose to create a hetzner volume to be used with Longhorn. By default, it will use the nodes own storage space, BUT if you an attribute of
  # longhorn_volume_size (⚠️ not a variable, just a possible agent nodepool attribute) with a value of 10 to 10000 GB to your agent nodepool definition, it will create and use the volume in question.
  # See the agent nodepool section for an example of how to do that.

  # To disable Hetzner CSI storage, you can set the following to true, default is "false".
  # disable_hetzner_csi = true

  # If you want to use a specific Hetzner CCM and CSI version, set them below; otherwise, leave them as-is for the latest versions.
  # hetzner_ccm_version = ""
  # hetzner_csi_version = ""

  # If you want to specify the Kured version, set it below - otherwise it'll use the latest version available.
  # kured_version = ""

  # If you want to enable the Nginx ingress controller (https://kubernetes.github.io/ingress-nginx/) instead of Traefik, you can set this to "true". Default is "false".
  # FOR THIS TO NOT BE IGNORED, you also need to set "enable_traefik = false".
  # By the default we load an optimal Nginx ingress controller config for Hetzner, however you may need to tweak it to your needs, so to do,
  # we allow you to add a nginx_ingress_values, see towards the end of this file in the advanced section.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.
  # enable_nginx = true

  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".
  # enable_traefik = false

  # Use the klipper LB (similar to metalLB), instead of the default Hetzner one, that has an advantage of dropping the cost of the setup.
  # Automatically "true" in the case of single node cluster (as it does not make sense to use the Hetzner LB in that situation).
  # It can work with any ingress controller that you choose to deploy.
  # Please note that because the klipper lb points to all nodes, we automatically allow scheduling on the control plane when it is active.
  # enable_klipper_metal_lb = "true"

  # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
  # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
  # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
  # traefik_acme_tls = true
  # traefik_acme_email = "mail@example.com"

  # If you want to configure additional Arguments for traefik, enter them here as a list and in the form of traefik CLI arguments; see https://doc.traefik.io/traefik/reference/static-configuration/cli/
  # They are the options that go into the additionalArguments section of the Traefik helm values file.
  # Example: traefik_additional_options = ["--log.level=DEBUG", "--tracing=true"]
  # traefik_additional_options = []

  # If you want to disable the metric server, you can! Default is "true".
  enable_metrics_server = true

  # If you want to allow non-control-plane workloads to run on the control-plane nodes, set "true" below. The default is "false".
  # True by default for single node clusters, and when enable_klipper_metal_lb is true. In those cases, the value below will be ignored.
  # allow_scheduling_on_control_plane = true

  # If you want to disable the automatic upgrade of k3s, you can set this to false.
  # Ideally, keep it on, to always have the latest and greatest Kubernetes version, but lock the initial_k3s_channel to a kube major version,
  # of your choice, like v1.24 or v1.25. That way you get the best of both worlds without the breaking changes risk.
  # The default is "true" (If you are in HA i.e. at least 3 control plane nodes & 2 agents, just keep it, it works great!)
  automatically_upgrade_k3s = false

  # For non-HA clusters i.e. when the number of control-plane nodes is < 3, you have to turn it off.
  # Ideally, for production use, always use an HA setup with at least 3 control-plane nodes and 2 agents, and keep this on for max security.
  # The default is "true" (in HA it works wonderfully well, with automatically roll-back to the previous snapshot in case of an issue).
  automatically_upgrade_os = false

  # If you need more control over kured and the reboot behaviour, you can pass additional options to kured.
  # For example limiting reboots to certain timeframes. For all options see: https://kured.dev/docs/configuration/
  # The default options are: `--reboot-command=/usr/bin/systemctl reboot --pre-reboot-node-labels=kured=rebooting --post-reboot-node-labels=kured=done --period=5m`
  # Defaults can be overridden by using the same key.
  # kured_options = {
  #   "reboot-days": "su"
  #   "start-time": "9am"
  #   "end-time": "5pm"
  # }

  # Allows you to specify either stable, latest, testing or supported minor versions (defaults to stable)
  # see https://rancher.com/docs/k3s/latest/en/upgrades/basic/ and https://update.k3s.io/v1-release/channels
  # ⚠️ If you are going to use Rancher addons for instance, it's always a good idea to fix the kube version to latest - 0.01,
  # at the time of writing the latest is v1.25, so setting the value below to "v1.24" will insure maximum compatibility with Rancher, Longhorn and so on!
  # The default is "v1.24".
  # initial_k3s_channel = "stable"

  # The cluster name, by default "k3s"
  cluster_name = "horizon"

  # Whether to use the cluster name in the node name, in the form of {cluster_name}-{nodepool_name}, the default is "true".
  # use_cluster_name_in_node_name = false

  # Extra k3s registries. This is useful if you have private registries and you
  # want to pull images without additional secrets.
  # registries.yaml file docs: https://docs.k3s.io/installation/private-registry
  /* k3s_registries = <<-EOT
    mirrors:
      hub.my_registry.com:
        endpoint:
          - "hub.my_registry.com"
    configs:
      hub.my_registry.com:
        auth:
          username: username
          password: password
  EOT */

  # Adding extra firewall rules, like opening a port
  # More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall
  # extra_firewall_rules = [
  #   # For Postgres
  #   {
  #     direction       = "in"
  #     protocol        = "tcp"
  #     port            = "5432"
  #     source_ips      = ["0.0.0.0/0", "::/0"]
  #     destination_ips = [] # Won't be used for this rule
  #   },
  #   # To Allow ArgoCD access to resources via SSH
  #   {
  #     direction       = "out"
  #     protocol        = "tcp"
  #     port            = "22"
  #     source_ips      = [] # Won't be used for this rule
  #     destination_ips = ["0.0.0.0/0", "::/0"]
  #   }
  # ]

  # If you want to configure a different CNI for k3s, use this flag
  # possible values: flannel (Default), calico, and cilium
  # As for Cilium, we allow infinite configurations via helm values, please check the CNI section of the readme over at https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/#cni.
  # Also, see the cilium_values at towards the end of this file, in the advanced section.
  # cni_plugin = "cilium"

  # If you want to disable the k3s default network policy controller, use this flag!
  # Both Calico and Ciliun cni_plugin values override this value to true automatically, the default is "false".
  # disable_network_policy = true

  # If you want to disable the automatic use of placement group "spread". See https://docs.hetzner.com/cloud/placement-groups/overview/
  # That may be useful if you need to deploy more than 500 nodes! The default is "false".
  # placement_group_disable = true

  # By default, we allow ICMP ping in to the nodes, to check for liveness for instance. If you do not want to allow that, you can. Just set this flag to true (false by default).
  # block_icmp_ping_in = true

  # You can enable cert-manager (installed by Helm behind the scenes) with the following flag, the default is "false".
  # enable_cert_manager = true

  # By default we use a known good mirror to download the needed OpenSUSE, but for instance, it may not be available in US-East location, in which case,
  # you can find a working mirror for you at https://download.opensuse.org/tumbleweed/appliances/openSUSE-MicroOS.x86_64-OpenStack-Cloud.qcow2.mirrorlist,
  # Or use the value below to automatically select an optimal mirror (by default we have it fixed to one german mirror that we know works for most locations)
  # opensuse_microos_mirror_link = "https://download.opensuse.org/tumbleweed/appliances/openSUSE-MicroOS.x86_64-OpenStack-Cloud.qcow2"

  # IP Addresses to use for the DNS Servers, set to an empty list to use the ones provided by Hetzner, defaults to ["1.1.1.1", " 1.0.0.1", "8.8.8.8"].
  # For rancher installs, best to leave it as default.
  # dns_servers = []

  # When this is enabled, rather than the first node, all external traffic will be routed via a control-plane loadbalancer, allowing for high availability.
  # The default is false.
  # use_control_plane_lb = true

  # Let's say you are not using the control plane LB solution above, and still want to have one hostname point to all your control-plane nodes.
  # You could create multiple A records of to let's say cp.cluster.my.org pointing to all of your control-plane nodes ips.
  # In which case, you need to define that hostname in the k3s TLS-SANs config to allow connection through it. It can be hostnames or IP addresses.
  # additional_tls_sans = ["cp.cluster.my.org"]

  # Oftentimes, you need to communicate to the cluster from inside the cluster itself, in which case it is important to set this value, as it will configure the hostname
  # at the load balancer level, and will save you from many slows downs when initiating communications from inside. Later on, you can point your DNS to the IP given
  # to the LB. And if you have other services pointing to it, you are also free to create CNAMES to point to it, or whatever you see fit.
  # If set, it will apply to either ingress controllers, Traefik or Ingress-Nginx.
  # lb_hostname = ""

  # You can enable Rancher (installed by Helm behind the scenes) with the following flag, the default is "false".
  # When Rancher is enabled, it automatically installs cert-manager too, and it uses rancher's own self-signed certificates.
  # See for options https://rancher.com/docs/rancher/v2.0-v2.4/en/installation/resources/advanced/helm2/helm-rancher/#choose-your-ssl-configuration
  # The easiest thing is to leave everything as is (using the default rancher self-signed certificate) and put Cloudflare in front of it.
  # As for the number of replicas, by default it is set to the numbe of control plane nodes.
  # You can customized all of the above by adding a rancher_values variable see at the end of this file in the advanced section.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.
  # IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane).
  # ALSO, in order for Rancher to successfully deploy, you have to set the "rancher_hostname".
  # enable_rancher = true

  # If using Rancher you can set the Rancher hostname, it must be unique hostname even if you do not use it.
  # If not pointing the DNS, you can just port-forward locally via kubectl to get access to the dashboard.
  # If you already set the lb_hostname above and are using a Hetzner LB, you do not need to set this one, as it will be used by default.
  # But if you set this one explicitly, it will have preference over the lb_hostname in rancher settings.
  # rancher_hostname = "rancher.xyz.dev"

  # When Rancher is deployed, by default is uses the "latest" channel. But this can be customized.
  # The allowed values are "stable" or "latest".
  # rancher_install_channel = "stable"

  # Finally, you can specify a bootstrap-password for your rancher instance. Minimum 48 characters long!
  # If you leave empty, one will be generated for you.
  # (Can be used by another rancher2 provider to continue setup of rancher outside this module.)
  # rancher_bootstrap_password = ""

  # Separate from the above Rancher config (only use one or the other). You can import this cluster directly on an
  # an already active Rancher install. By clicking "import cluster" choosing "generic", giving it a name and pasting
  # the cluster registration url below. However, you can also ignore that and apply the url via kubectl as instructed
  # by Rancher in the wizard, and that would register your cluster too.
  # More information about the registration can be found here https://rancher.com/docs/rancher/v2.6/en/cluster-provisioning/registered-clusters/
  # rancher_registration_manifest_url = "https://rancher.xyz.dev/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml"

  # Extra values that will be passed to the `extra-manifests/kustomization.yaml.tpl` if its present.
  # extra_kustomize_parameters={}

  extra_packages_to_install=["nfs-utils", "nfs-client"]

  # It is best practice to turn this off, but for backwards compatibility it is set to "true" by default.
  # See https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/349
  # When "false". The kubeconfig file can instead be created by executing: "terraform output --raw kubeconfig > cluster_kubeconfig.yaml"
  # Always be careful to not commit this file!
  # create_kubeconfig = false

  # Don't create the kustomize backup. This can be helpful for automation.
  # create_kustomization = false

  ### ADVANCED - Custom helm values for packages above (search _values if you want to located where those are mentioned upper in this file)
  # ⚠️ Inside the _values variable below are examples, up to you to find out the best helm values possible, we do not provide support for customized helm values.
  # Please understand that the indentation is very important, inside the EOTs, as those are proper yaml helm values.
  # We advise you to use the default values, and only change them if you know what you are doing!

  # Cilium, all Cilium helm values can be found at https://github.com/cilium/cilium/blob/master/install/kubernetes/cilium/values.yaml
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   cilium_values = <<EOT
ipam:
  mode: kubernetes
devices: "eth1"
k8s:
  requireIPv4PodCIDR: true
kubeProxyReplacement: strict
  EOT */

  # Cert manager, all cert-manager helm values can be found at https://github.com/cert-manager/cert-manager/blob/master/deploy/charts/cert-manager/values.yaml
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   cert_manager_values = <<EOT
installCRDs: true
replicaCount: 3
webhook:
  replicaCount: 3
cainjector:
  replicaCount: 3
  EOT */

  # Longhorn, all Longhorn helm values can be found at https://github.com/longhorn/longhorn/blob/master/chart/values.yaml
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   longhorn_values = <<EOT
defaultSettings:
  defaultDataPath: /var/longhorn
persistence:
  defaultFsType: ext4
  defaultClassReplicaCount: 3
  defaultClass: true
  EOT */

  # Nginx, all Nginx helm values can be found at https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml
  # You can also have a look at https://kubernetes.github.io/ingress-nginx/, to understand how it works, and all the options at your disposal.
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   nginx_ingress_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    annotations:
      "load-balancer.hetzner.cloud/name": "k3s"
      "load-balancer.hetzner.cloud/use-private-ip": "true"
      "load-balancer.hetzner.cloud/disable-private-ingress": "true"
      "load-balancer.hetzner.cloud/location": "nbg1"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
  EOT */

  # Rancher, all Rancher helm values can be found at https://rancher.com/docs/rancher/v2.5/en/installation/install-rancher-on-k8s/chart-options/
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   rancher_values = <<EOT
ingress:
  tls:
    source: "rancher"
hostname: "rancher.example.com"
replicas: 1
bootstrapPassword: "supermario"
  EOT */

}

There it is! I sent my kubernetes deployment configs previously. Will have some more time later to look into it.

mysticaltech commented 1 year ago

Awesome! Will have a look ASAP, tonight I can't but sometime tomorrow or the week-end. If anyone else feels like debugging, please jump in folks.

aleksasiriski commented 1 year ago

I'm testing it now, with this config:

locals {
  hcloud_token = ""
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token         = local.hcloud_token
  source               = "kube-hetzner/kube-hetzner/hcloud"
  ssh_public_key       = file("~/.ssh/id_rsa.pub")
  ssh_private_key      = file("~/.ssh/id_rsa")
  network_region       = "eu-central"
  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]
  agent_nodepools = [
    {
      name        = "agent",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]
  control_planes_custom_config = {
    etcd-expose-metrics         = true,
    kube-controller-manager-arg = "bind-address=0.0.0.0",
    kube-proxy-arg              = "metrics-bind-address=0.0.0.0",
    kube-scheduler-arg          = "bind-address=0.0.0.0",
  }
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"
  base_domain            = "test.example.com"
  autoscaler_nodepools = [
    {
      name        = "autoscaler"
      server_type = "cx21"
      location    = "nbg1"
      min_nodes   = 1
      max_nodes   = 3
    }
  ]
  enable_metrics_server     = true
  automatically_upgrade_k3s = false
  automatically_upgrade_os  = false
  cluster_name              = "test"
  extra_packages_to_install = ["nfs-utils", "nfs-client"]
}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.3.3"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.35.2"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}
ifeulner commented 1 year ago

Please try the above fix, with that it should work correctly again, also if the node reboots.

aleksasiriski commented 1 year ago

Please try the above fix, with that it should work correctly again, also if the node reboots.

I'll try to recreate the problem and then test this fix

ifeulner commented 1 year ago

Thx, but after checking the error message again I think there is maybe another issue. Can't test it currently as I am on a business trip.

aleksasiriski commented 1 year ago

I get these private IPs for autoscaled nodes image

They also aren't added to the cluster... SSH-ing into them: image

mysticaltech commented 1 year ago

@aleksasiriski Maybe you could try the following:

aleksasiriski commented 1 year ago

@aleksasiriski Maybe you could try the following:

* See if the services are loaded with `systemctl status k3s`.

* Check `/var/rancher/k3s` and see if it's populated.

* Try to grep the whole journalctl for k3s with `journalctl | grep k3s`, or inspected all alone to see all logs since boot.

* Try to inspect the logic in `autoscaler-agent.tf` too.

Unit k3s.service could not be found.

ls: cannot access '/var/rancher/k3s': No such file or directory

Jan 26 23:29:10 tmina-agent-autoscaled-ram-fsn1-36df4331a9770722 cloud-init[1364]: /bin/sh: /root/install-k3s-agent.sh: No such file or directory
Jan 26 23:29:10 tmina-agent-autoscaled-ram-fsn1-36df4331a9770722 cloud-init[1364]: Failed to start k3s-agent.service: Unit k3s-agent.service not found.
aleksasiriski commented 1 year ago

I'll look into the logic tomorrow!

mysticaltech commented 1 year ago

Thanks @aleksasiriski! I think putting the install script in root is not good. Ideally, it should be in /var which is writable. Ideally in /var/kube-hetzner for instance.

mysticaltech commented 1 year ago

I will push the fix above to the PR right away. And then anyone that wants can test it again.

mysticaltech commented 1 year ago

Also just tips for future debug, it's super interesting in these instances to look at the console in the Hetzner cloud UI for that node. As it outputs all that is happening. It's like a virtual monitor plugged into the server.

mysticaltech commented 1 year ago

@aleksasiriski If you are still on, you can try again if you want. Otherwise @AlexProgrammerDE please do try it (you point the source to the local copy of the PR).

mysticaltech commented 1 year ago

This is fixed folks. It was in an insidious bug.

The k3s_registries was defaulting to "", but cloud-init was seeing as None, as erroring with None is not a string. So replaced it with " ", while doing some other cleanups and small fixes.

mysticaltech commented 1 year ago

What gave it up was proceeding by elimination, and viewing the output of journalctl -u cloud-init-local.service -u cloud-init.service -u cloud-config.service -u cloud-final.service, that releaved a problem with write_files.

ksnip_20230127-060433

ksnip_20230127-052415

AlexProgrammerDE commented 1 year ago

Alright thanks for the quick help! Will give it a try again when I'm back at my computer.

AlexProgrammerDE commented 1 year ago

Works wonderfully now! Keep up the good work!