kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.19k stars 343 forks source link

LB is not created and waits to get an IP #192

Closed cruex-de closed 2 years ago

cruex-de commented 2 years ago

I have now a few times the cluster destroyed, the locations and the size of the LB's changed or all traefik settings disabled.... Unfortunately still without success, it always comes the same error message. However, the LB is also not created at Hetzner.

Version: v1.1.8

null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization: Still creating... [3m0s elapsed]
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization: Still creating... [3m10s elapsed]
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization: Still creating... [3m20s elapsed]
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
╷
│ Error: remote-exec provisioner error
│
│   with null_resource.kustomization,
│   on init.tf line 196, in resource "null_resource" "kustomization":
│  196:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_1856610513.sh": Process exited with status 124

terraform.tfvars: network_region = "eu-central" load_balancer_type = "lb21" load_balancer_location = "nbg1" traefik_acme_tls = true traefik_acme_email = "admin@mail.com" traefik_enabled = true traefik_additional_options = []

mysticaltech commented 2 years ago

Hey @cruex-de, please try removing those (just delete the line, destroy and redeploy):

traefik_acme_tls = true traefik_acme_email = "admin@mail.com" traefik_enabled = true traefik_additional_options = []

For TLS it's best to use cert-manager anyways!

cruex-de commented 2 years ago

traefik_enabled = true

Default is true, so set to false?

mysticaltech commented 2 years ago

No, no need (it's already true behind the scene), unless you do not want Traefik.

cruex-de commented 2 years ago

Still the same error:

null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization: Still creating... [3m20s elapsed]
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
╷
│ Error: remote-exec provisioner error
│
│   with null_resource.kustomization,
│   on init.tf line 196, in resource "null_resource" "kustomization":
│  196:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_1934208973.sh": Process exited with status 124
mysticaltech commented 2 years ago

Ok, just out of curiosity, can you share most of your variables?

And also try with another location, like fsn1, last but not least, make sure to use the latest version.

cruex-de commented 2 years ago

Ok, just out of curiosity, can you share most of your variables?

# Only the first values starting with a * are obligatory; the rest can remain with their default values, or you
# could adapt them to your needs.
#
# Note that some values, notably "location" and "public_key" have no effect after initializing the cluster.
# This is to keep Terraform from re-provisioning all nodes at once, which would lose data. If you want to update
# those, you should instead change the value here and manually re-provision each node. Grep for "lifecycle".

# * Your Hetzner project API token
hcloud_token = "XXXXXXXXXXXXXXXXXXXXX"
# * Your public key
public_key = "/home/xxx/.ssh/id_ed25519.pub"
# * Your private key must be "private_key = null" when you want to use ssh-agent for a Yubikey-like device authentification or an SSH key-pair with a passphrase.
# For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
private_key = "/home/xxx/.ssh/id_ed25519"

# These can be customized, or left with the default values
# * For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
network_region = "eu-central" # change to `us-east` if location is ash

# For the control planes, at least three nodes are the minimum for HA. Otherwise, you need to turn off the automatic upgrade (see ReadMe).
# As per Rancher docs, it must always be an odd number, never even! See https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/
# For instance, one is ok (non-HA), two is not ok, and three is ok (becomes HA). It does not matter if they are in the same nodepool or not! So they can be in different locations and of various types.

# Of course, you can choose any number of nodepools you want, with the location you want. The only constraint on the location is that you need to stay in the same network region, Europe, or the US.
# For the server type, the minimum instance supported is cpx11 (just a few cents more than cx11); see https://www.hetzner.com/cloud.

# IMPORTANT: Before you create your cluster, you can do anything you want with the nodepools, but you need at least one of each control plane and agent.
# Once the cluster is up and running, you can change nodepool count and even set it to 0 (in the case of the first control-plane nodepool, the minimum is 1),
# you can also rename it (if the count is 0), but do not remove a nodepool from the list.

# The only nodepools that are safe to remove from the list when you edit it are at the end of the lists. That is due to how subnets and IPs get allocated (FILO).
# You can, however, freely add other nodepools at the end of each list if you want! The maximum number of nodepools you can create combined for both lists is 255.
# Also, before decreasing the count of any nodepools to 0, it's essential to drain and cordon the nodes in question. Otherwise, it will leave your cluster in a bad state.

# Before initializing the cluster, you can change all parameters and add or remove any nodepools. You need at least one nodepool of each kind, control plane, and agent.
# The nodepool names are entirely arbitrary, you can choose whatever you want, but no special characters or underscore, and they must be unique; only alphanumeric characters and dashes are allowed.

# If you want to have a single node cluster, have one control plane nodepools with a count of 1, and one agent nodepool with a count of 0.

# * Example below:

control_plane_nodepools = [
  {
    name        = "control-fsn1",
    server_type = "cpx21",
    location    = "fsn1",
    labels      = [],
    taints      = [],
    count       = 1
  },
  {
    name        = "control-nbg1",
    server_type = "cpx21",
    location    = "nbg1",
    labels      = [],
    taints      = [],
    count       = 1
  },
  {
    name        = "control-hel1",
    server_type = "cpx21",
    location    = "hel1",
    labels      = [],
    taints      = [],
    count       = 1
  }
]

agent_nodepools = [
  {
    name        = "node-fsn1-cpx11",
    server_type = "cpx11",
    location    = "fsn1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
  {
    name        = "node-nbg1-cpx11",
    server_type = "cpx11",
    location    = "nbg1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
    {
    name        = "node-hel1-cpx11",
    server_type = "cpx11",
    location    = "hel1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
  {
    name        = "node-fsn1-cpx21",
    server_type = "cpx21",
    location    = "fsn1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
  {
    name        = "node-nbg1-cpx21",
    server_type = "cpx21",
    location    = "nbg1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
  {
    name        = "node-hel1-cpx21",
    server_type = "cpx21",
    location    = "hel1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 1
  },
  {
    name        = "node-fsn1-cpx31",
    server_type = "cpx31",
    location    = "fsn1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-nbg1-cpx31",
    server_type = "cpx31",
    location    = "nbg1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-hel1-cpx31",
    server_type = "cpx31",
    location    = "hel1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-fsn1-cpx41",
    server_type = "cpx41",
    location    = "fsn1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-nbg1-cpx41",
    server_type = "cpx41",
    location    = "nbg1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-hel1-cpx41",
    server_type = "cpx41",
    location    = "hel1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-fsn1-cpx51",
    server_type = "cpx51",
    location    = "fsn1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-nbg1-cpx51",
    server_type = "cpx51",
    location    = "nbg1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  },
  {
    name        = "node-hel1-cpx51",
    server_type = "cpx51",
    location    = "hel1",
    labels = [
      "node.kubernetes.io/server-usage=storage"
    ],
    taints = [
      "server-usage=storage:NoSchedule"
    ],
    count       = 0
  }
]

# * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
load_balancer_type     = "lb21"
load_balancer_location = "nbg1"

### The following values are entirely optional

# To use local storage on the nodes, you can enable Longhorn, default is "false".
# enable_longhorn = true

# To disable Hetzner CSI storage, you can set the following to true, default is "false".
# disable_hetzner_csi = true

# If you want to use a specific Hetzner CCM and CSI version, set them below; otherwise, leave them as-is for the latest versions.
# hetzner_ccm_version = ""
# hetzner_csi_version = ""

# If you want to specify the Kured version, set it below - otherwise it'll use the latest version available
# kured_version = ""

# We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
# as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
# we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
# traefik_acme_tls = true
# traefik_acme_email = "admin@mail.com"

# If you want to disable the Traefik ingress controller, you can can set this to "false". Default is "true".
# traefik_enabled = false

# If you want to disable the metric server, you can! Default is "true".
# metrics_server_enabled = false

# If you want to allow non-control-plane workloads to run on the control-plane nodes, set "true" below. The default is "false".
# True by default for single node clusters.
# allow_scheduling_on_control_plane = true

# If you want to disable the automatic upgrade of k3s, you can set this to false. The default is "true".
# automatically_upgrade_k3s = false

# Allows you to specify either stable, latest, testing or supported minor versions (defaults to stable)
# see https://rancher.com/docs/k3s/latest/en/upgrades/basic/ and https://update.k3s.io/v1-release/channels
# initial_k3s_channel = "latest"

# The cluster name, by default "k3s"
cluster_name = "kreios"

# Whether to use the cluster name in the node name, in the form of {cluster_name}-{nodepool_name}, the default is "true".
# use_cluster_name_in_node_name = false

# Adding extra firewall rules, like opening a port
# More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall
# extra_firewall_rules = [
#   # For Postgres
#   {
#     direction       = "in"
#     protocol        = "tcp"
#     port            = "5432"
#     source_ips      = ["0.0.0.0/0", "::/0"]
#     destination_ips = [] # Won't be used for this rule 
#   },
#   # To Allow ArgoCD access to resources via SSH
#   {
#     direction       = "out"
#     protocol        = "tcp"
#     port            = "22"
#     source_ips      = [] # Won't be used for this rule 
#     destination_ips = ["0.0.0.0/0", "::/0"]
#   }
# ]

# If you want to configure additional Arguments for traefik, enter them here as a list and in the form of traefik CLI arguments; see https://doc.traefik.io/traefik/reference/static-configuration/cli/
# Example: traefik_additional_options = ["--log.level=DEBUG", "--tracing=true"]
# traefik_additional_options = []

# Use the klipper LB, instead of the default Hetzner one, that has an advantage of dropping the cost of the setup,
# but you would need to point your DNS to every schedulable IPs in your cluster (usually agents). The default is "false".
# Automatically "true" in the case of single node cluster.
# use_klipper_lb = "true"

# If you want to configure a different CNI for k3s, use this flag
# possible values: flannel (Default), calico
# Cilium or other would be easy to add, you can mirror how Calico was added. PRs are welcome!
# CAVEATS: Calico is not supported when not using the Hetzner LB (like when use_klipper_lb is set to true or when using a single node cluster),
# because of the following issue https://github.com/k3s-io/klipper-lb/issues/6.
# cni_plugin = "calico"

# If you want to disable the k3s default network policy controller, use this flag!
# Calico overrides this value to true automatically, the default is "false".
# disable_network_policy = true

# If you want to disable the automatic use of placement group "spread". See https://docs.hetzner.com/cloud/placement-groups/overview/
# That may be useful if you need to deploy more than 500 nodes! The default is "false".
# placement_group_disable = true

# You can enable cert-manager (installed by Helm behind the scenes) with the following flag, the default is "false".
enable_cert_manager = true

# You can enable Rancher (installed by Helm behind the scenes) with the following flag, the default is "false".
# When Rancher is enabled, it automatically installs cert-manager too, and it uses rancher's own certificates.
# As for the number of replicas, it is set to the numbe of control plane nodes.
# IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane).
# You can customized all of the above by creating and applying a HelmChartConfig to pass the helm chart values of your choice. 
# See https://rancher.com/docs/k3s/latest/en/helm/ 
# and https://rancher.com/docs/rancher/v2.6/en/installation/install-rancher-on-k8s/chart-options/
enable_rancher = true

# When Rancher is deployed, by default is uses the "stable" channel. But this can be customized.
# The allowed values are "stable", "latest", and "alpha".
# rancher_install_channel = "latest"

# Set your Rancher hostname, the default is "rancher.example.com".
# It is a required value when using rancher, but up to you to point the DNS to it or not. 
# You can also not point the DNS, and just port-forward locally via kubectl to get access to the dashboard.
rancher_hostname = "rancher.domain.com"

# Separate from the above Rancher config (only use one or the other). You can import this cluster directly on an
# an already active Rancher install. By clicking "import cluster" choosing "generic", giving it a name and pasting
# the cluster registration url below. However, you can also ignore that and apply the url via kubectl as instructed
# by Rancher in the wizard, and that would register your cluster too.
# More information about the registration can be found here https://rancher.com/docs/rancher/v2.6/en/cluster-provisioning/registered-clusters/
# rancher_registration_manifest_url = "https://rancher.domain.com/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml"

And also try with another location, like fsn1

I had already tried.

last but not least, make sure to use the latest version

I am already using the last version: v1.1.8

EDIT: I disabled rancher, no change same error

mysticaltech commented 2 years ago

The way I see it, your agent node pools are all tainted with no schedule, that could be the reason!

mysticaltech commented 2 years ago

Try my variables and rebuild towards yours!

hcloud_token = "xxxx"
public_key   = "/home/karim/.ssh/id_ed25519.pub"
private_key  = "/home/karim/.ssh/id_ed25519"

network_region = "eu-central"

control_plane_nodepools = [
  {
    name        = "control-plane",
    server_type = "cx21",
    location    = "fsn1",
    labels      = [],
    taints      = [],
    count       = 3
  }
]

agent_nodepools = [
  {
    name        = "agent",
    server_type = "cx21",
    location    = "fsn1",
    labels      = [],
    taints      = [],
    count       = 2
  }
]

load_balancer_type     = "lb11"
load_balancer_location = "fsn1"

cluster_name = "test12"

extra_firewall_rules = [
  # For Postgres
  {
    direction       = "in"
    protocol        = "tcp"
    port            = "5432"
    source_ips      = ["0.0.0.0/0", "::/0"]
    destination_ips = [] # Won't be used for this rule 
  },
  # To Allow ArgoCD access to resources via SSH
  {
    direction       = "out"
    protocol        = "tcp"
    port            = "22"
    source_ips      = [] # Won't be used for this rule 
    destination_ips = ["0.0.0.0/0", "::/0"]
  }
]

enable_rancher          = false
rancher_install_channel = "latest"
rancher_hostname        = "rancher.domain.dev"
cruex-de commented 2 years ago

Thanks, now its working 🥳