kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.42k stars 371 forks source link

[Bug]: Creation on new cluster stuck on configuring agent node #1289

Closed mateuszlewko closed 8 months ago

mateuszlewko commented 8 months ago

Description

I'm launching a new cluster on the latest version of this package. I'm using ed25519 ssh keys without a password. Creation of completely new cluster seems to be stuck (for > 40 min) on configuration of a single agent node (the server is present in hetzner UI).

This is the last excerpt from logs:

module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): + /sbin/semodule -v -i /usr/share/selinux/packages/k3s.pp
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Attempting to install module '/usr/share/selinux/packages/k3s.pp':
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Ok: return value of 0.
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Committing changes:
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [20s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [20s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Ok: transaction number 9.
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): + restorecon -v /usr/local/bin/k3s
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Provisioning with 'remote-exec'...
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Connecting to remote host via SSH...
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Host: <redacted>
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   User: root
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Password: false
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Private key: true
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Certificate: false
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   SSH Agent: true
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Checking Host Key: false
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec):   Target Platform: unix
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"] (remote-exec): Connected!
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [30s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [30s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [40s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [40s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [50s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [50s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): E0319 16:46:54.121958    9462 reflector.go:147] k8s.io/client-go@v1.29.2-k3s1/tools/cache/reflector.go:229: Failed to watch *unstructured.Unstructured: the server is currently unable to handle the request
....
module.kube-hetzner.null_resource.kustomization: Still creating... [3m20s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [3m30s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [3m30s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): W0319 16:50:59.933389    9462 reflector.go:539] k8s.io/client-go@v1.29.2-k3s1/tools/cache/reflector.go:229: failed to list *unstructured.Unstructured: Get "https://127.0.0.1:6443/apis/apps/v1/namespaces/system-upgrade/deployments?fieldSelector=metadata.name%3Dsystem-upgrade-controller&resourceVersion=5821": dial tcp 127.0.0.1:6443: connect: connection refused
module.kube-hetzner.null_resource.kustomization (remote-exec): E0319 16:50:59.933790    9462 reflector.go:147] k8s.io/client-go@v1.29.2-k3s1/tools/cache/reflector.go:229: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://127.0.0.1:6443/apis/apps/v1/namespaces/system-upgrade/deployments?fieldSelector=metadata.name%3Dsystem-upgrade-controller&resourceVersion=5821": dial tcp 127.0.0.1:6443: connect: connection refused
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [5m10s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [5m10s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [5m20s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [5m20s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [5m30s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [5m30s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [5m40s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [5m40s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [5m50s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [5m50s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [6m0s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [6m0s elapsed]
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [6m10s elapsed]
module.kube-hetzner.null_resource.kustomization: Still creating... [6m10s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller
module.kube-hetzner.null_resource.agents["0-0-agent-cax21-fsn1"]: Still creating... [6m20s elapsed]
...
... still creating for 40 min

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = data.doppler_secrets.this.map.HETZNER_PROJECT_API_TOKEN

  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.13.4"

  # For details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_public_key  = file(...)
  ssh_private_key = file(...)

  # For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cax11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cax11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name            = "agent-cax21-fsn1",
      server_type     = "cax21",
      location        = "fsn1",
      labels          = [],
      taints          = [],
      count           = 1
      placement_group = "default"
    },
  ]

  # Add custom control plane configuration options here.
  # E.g to enable monitoring for etcd, proxy etc:
  # control_planes_custom_config = {
  #  etcd-expose-metrics = true,
  #  kube-controller-manager-arg = "bind-address=0.0.0.0",
  #  kube-proxy-arg ="metrics-bind-address=0.0.0.0",
  #  kube-scheduler-arg = "bind-address=0.0.0.0",
  # }

  enable_wireguard = true

  # https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  # To use local storage on the nodes, you can enable Longhorn, default is "false".
  # See a full recap on how to configure agent nodepools for longhorn here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/373#discussioncomment-3983159
  # Also see Longhorn best practices here https://gist.github.com/ifeulner/d311b2868f6c00e649f33a72166c2e5b
  enable_longhorn = true

  # how many replica volumes should longhorn create (default is 3).
  # longhorn_replica_count = 1

  # When you enable Longhorn, you can go with the default settings and just modify the above two variables OR you can add a longhorn_values variable
  # with all needed helm values, see towards the end of the file in the advanced section.
  # If that file is present, the system will use it during the deploy, if not it will use the default values with the two variable above that can be customized.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.

  # Also, you can choose to use a Hetzner volume with Longhorn. By default, it will use the nodes own storage space, but if you add an attribute of
  # longhorn_volume_size (⚠️ not a variable, just a possible agent nodepool attribute) with a value between 10 and 10000 GB to your agent nodepool definition, it will create and use the volume in question.
  # See the agent nodepool section for an example of how to do that.

  # traefik_additional_options = ["--log.level=DEBUG", "--tracing=true"]

  # By default traefik image tag is an empty string which uses latest image tag.
  # The default is "".
  # traefik_image_tag = "v3.0.0-beta5"

  # By default traefik is configured to redirect http traffic to https, you can set this to "false" to disable the redirection.
  # The default is true.
  # traefik_redirect_to_https = false

  # Enable or disable Horizontal Pod Autoscaler for traefik.
  # The default is true.
  # traefik_autoscaling = false

  # Enable or disable pod disruption budget for traefik. Values are maxUnavailable: 33% and minAvailable: 1.
  # The default is true.
  # traefik_pod_disruption_budget = false

  # Enable or disable default resource requests and limits for traefik. Values requested are 100m & 50Mi and limits 300m & 150Mi.
  # The default is true.
  # traefik_resource_limits = false

  # If you want to configure additional trusted IPs for traefik, enter them here as a list of IPs (strings).
  # Example for Cloudflare:
  traefik_additional_trusted_ips = [
    "173.245.48.0/20",
    "103.21.244.0/22",
    "103.22.200.0/22",
    "103.31.4.0/22",
    "141.101.64.0/18",
    "108.162.192.0/18",
    "190.93.240.0/20",
    "188.114.96.0/20",
    "197.234.240.0/22",
    "198.41.128.0/17",
    "162.158.0.0/15",
    "104.16.0.0/13",
    "104.24.0.0/14",
    "172.64.0.0/13",
    "131.0.72.0/22",
    "2400:cb00::/32",
    "2606:4700::/32",
    "2803:f800::/32",
    "2405:b500::/32",
    "2405:8100::/32",
    "2a06:98c0::/29",
    "2c0f:f248::/32"
  ]

  # If you want to enable the k3s built-in local-storage controller set this to "true". Default is "false".
  # enable_local_storage = false

  # For all options see: https://kured.dev/docs/configuration/
  kured_options = {
    "reboot-days" : "sa",
    "start-time" : "8am",
    "end-time" : "2pm",
    "time-zone" : "Local",
    "lock-release-delay" : "1h",
    "drain-grace-period" : 180,
  }

  # Additional environment variables for the host OS on which k3s runs. See for example https://docs.k3s.io/advanced#configuring-an-http-proxy .
  # additional_k3s_environment = {
  #   "CONTAINERD_HTTP_PROXY" : "http://your.proxy:port",
  #   "CONTAINERD_HTTPS_PROXY" : "http://your.proxy:port",
  #   "NO_PROXY" : "127.0.0.0/8,10.0.0.0/8,",
  # }

  # Additional commands to execute on the host OS before the k3s install, for example fetching and installing certs.
  # preinstall_exec = [
  #   "curl https://somewhere.over.the.rainbow/ca.crt > /root/ca.crt",
  #   "trust anchor --store /root/ca.crt",
  # ]

  # Additional flags to pass to the k3s server command (the control plane).
  # k3s_exec_server_args = "--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"

  # Additional flags to pass to the k3s agent command (every agents nodes, including autoscaler nodepools).
  # k3s_exec_agent_args = "--kubelet-arg kube-reserved=cpu=100m,memory=200Mi,ephemeral-storage=1Gi"

  # The vars below here passes it to the k3s config.yaml. This way it persist across reboots
  # k3s_global_kubelet_args = ["kube-reserved=cpu=100m,ephemeral-storage=1Gi", "system-reserved=cpu=memory=200Mi", "image-gc-high-threshold=50", "image-gc-low-threshold=40"]
  # k3s_control_plane_kubelet_args = []
  # k3s_agent_kubelet_args = []
  # k3s_autoscaler_kubelet_args = []

  # If you want to allow all outbound traffic you can set this to "false". Default is "true".
  # restrict_outbound_traffic = false

  # Allow access to the Kube API from the specified networks. The default is ["0.0.0.0/0", "::/0"].
  # Allowed values: null (disable Kube API rule entirely) or a list of allowed networks with CIDR notation.
  # For maximum security, it's best to disable it completely by setting it to null. However, in that case, to get access to the kube api,
  # you would have to connect to any control plane node via SSH, as you can run kubectl from within these.
  # Please be advised that this setting has no effect on the load balancer when the use_control_plane_lb variable is set to true. This is
  # because firewall rules cannot be applied to load balancers yet. 
  # firewall_kube_api_source = null

  # Allow SSH access from the specified networks. Default: ["0.0.0.0/0", "::/0"]
  # Allowed values: null (disable SSH rule entirely) or a list of allowed networks with CIDR notation.
  # Ideally you would set your IP there. And if it changes after cluster deploy, you can always update this variable and apply again.
  # firewall_ssh_source = ["1.2.3.4/32"]

  # Adding extra firewall rules, like opening a port
  # More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall
  # extra_firewall_rules = [
  #   {
  #     description = "For Postgres"
  #     direction       = "in"
  #     protocol        = "tcp"
  #     port            = "5432"
  #     source_ips      = ["0.0.0.0/0", "::/0"]
  #     destination_ips = [] # Won't be used for this rule
  #   },
  #   {
  #     description = "To Allow ArgoCD access to resources via SSH"
  #     direction       = "out"
  #     protocol        = "tcp"
  #     port            = "22"
  #     source_ips      = [] # Won't be used for this rule
  #     destination_ips = ["0.0.0.0/0", "::/0"]
  #   }
  # ]

  # If you want to configure a different CNI for k3s, use this flag
  # possible values: flannel (Default), calico, and cilium
  # As for Cilium, we allow infinite configurations via helm values, please check the CNI section of the readme over at https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/#cni.
  # Also, see the cilium_values at towards the end of this file, in the advanced section.
  # ⚠️ Depending on your setup, sometimes you need your control-planes to have more than
  # 2GB of RAM if you are going to use Cilium, otherwise the pods will not start.
  # cni_plugin = "cilium"

  # You can choose the version of Cilium that you want. By default we keep the version up to date and configure Cilium with compatible settings according to the version.
  # cilium_version = "v1.14.0"

  # Set native-routing mode ("native") or tunneling mode ("tunnel"). Default: tunnel
  # cilium_routing_mode = "native"

  # Used when Cilium is configured in native routing mode. The CNI assumes that the underlying network stack will forward packets to this destination without the need to apply SNAT. Default: value of "cluster_ipv4_cidr"
  # cilium_ipv4_native_routing_cidr = "10.0.0.0/8"

  # Enables egress gateway to redirect and SNAT the traffic that leaves the cluster. Default: false
  # cilium_egress_gateway_enabled = true

  # You can choose the version of Calico that you want. By default, the latest is used.
  # More info on available versions can be found at https://github.com/projectcalico/calico/releases
  # Please note that if you are getting 403s from Github, it's also useful to set the version manually. However there is rarely a need for that!
  # calico_version = "v3.27.2"

  # By default, we allow ICMP ping in to the nodes, to check for liveness for instance. If you do not want to allow that, you can. Just set this flag to true (false by default).
  # block_icmp_ping_in = true

  enable_cert_manager = true

  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]

  # When this is enabled, rather than the first node, all external traffic will be routed via a control-plane loadbalancer, allowing for high availability.
  # The default is false.
  use_control_plane_lb = true

  # When the above use_control_plane_lb is enabled, you can change the lb type for it, the default is "lb11".
  control_plane_lb_type = "lb21"

  # When the above use_control_plane_lb is enabled, you can change to disable the public interface for control plane load balancer, the default is true.
  # control_plane_lb_enable_public_interface = false

  # Let's say you are not using the control plane LB solution above, and still want to have one hostname point to all your control-plane nodes.
  # You could create multiple A records of to let's say cp.cluster.my.org pointing to all of your control-plane nodes ips.
  # In which case, you need to define that hostname in the k3s TLS-SANs config to allow connection through it. It can be hostnames or IP addresses.
  # additional_tls_sans = ["cp.cluster.my.org"]

  # lb_hostname Configuration:
  #
  # Purpose:
  # The lb_hostname setting optimizes communication between services within the Kubernetes cluster
  # when they use domain names instead of direct service names. By associating a domain name directly
  # with the Hetzner Load Balancer, this setting can help reduce potential communication delays.
  #
  # Scenario:
  # If Service B communicates with Service A using a domain (e.g., `a.mycluster.domain.com`) that points
  # to an external Load Balancer, there can be a slowdown in communication.
  #
  # Guidance:
  # - If your internal services use domain names pointing to an external LB, set lb_hostname to a domain
  #   like `mycluster.domain.com`.
  # - Create an A record pointing `mycluster.domain.com` to your LB's IP.
  # - Create a CNAME record for `a.mycluster.domain.com` (or xyz.com) pointing to `mycluster.domain.com`.
  #
  # Technical Note:
  # This setting sets the `load-balancer.hetzner.cloud/hostname` in the Hetzner LB definition, suitable for
  # both Nginx and Traefik ingress controllers.
  #
  # Recommendation:
  # This setting is optional. If services communicate using direct service names, you can leave this unset.
  # For inter-namespace communication, use `.service_name` as per Kubernetes norms.
  #
  # Example:
  # lb_hostname = "mycluster.domain.com"

  # You can enable Rancher (installed by Helm behind the scenes) with the following flag, the default is "false".
  # ⚠️ Rancher currently only supports Kubernetes v1.25 and earlier, you will need to set initial_k3s_channel to a supported version: https://github.com/rancher/rancher/issues/41113
  # When Rancher is enabled, it automatically installs cert-manager too, and it uses rancher's own self-signed certificates.
  # See for options https://rancher.com/docs/rancher/v2.0-v2.4/en/installation/resources/advanced/helm2/helm-rancher/#choose-your-ssl-configuration
  # The easiest thing is to leave everything as is (using the default rancher self-signed certificate) and put Cloudflare in front of it.
  # As for the number of replicas, by default it is set to the number of control plane nodes.
  # You can customized all of the above by adding a rancher_values variable see at the end of this file in the advanced section.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.
  # IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane).
  # ALSO, in order for Rancher to successfully deploy, you have to set the "rancher_hostname".
  # enable_rancher = true

  # If using Rancher you can set the Rancher hostname, it must be unique hostname even if you do not use it.
  # If not pointing the DNS, you can just port-forward locally via kubectl to get access to the dashboard.
  # If you already set the lb_hostname above and are using a Hetzner LB, you do not need to set this one, as it will be used by default.
  # But if you set this one explicitly, it will have preference over the lb_hostname in rancher settings.
  # rancher_hostname = "rancher.xyz.dev"

  # When Rancher is deployed, by default is uses the "latest" channel. But this can be customized.
  # The allowed values are "stable" or "latest".
  # rancher_install_channel = "stable"

  # Finally, you can specify a bootstrap-password for your rancher instance. Minimum 48 characters long!
  # If you leave empty, one will be generated for you.
  # (Can be used by another rancher2 provider to continue setup of rancher outside this module.)
  # rancher_bootstrap_password = ""

  # Separate from the above Rancher config (only use one or the other). You can import this cluster directly on an
  # an already active Rancher install. By clicking "import cluster" choosing "generic", giving it a name and pasting
  # the cluster registration url below. However, you can also ignore that and apply the url via kubectl as instructed
  # by Rancher in the wizard, and that would register your cluster too.
  # More information about the registration can be found here https://rancher.com/docs/rancher/v2.6/en/cluster-provisioning/registered-clusters/
  # rancher_registration_manifest_url = "https://rancher.xyz.dev/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml"

  # Extra commands to be executed after the `kubectl apply -k` (useful for post-install actions, e.g. wait for CRD, apply additional manifests, etc.).
  # extra_kustomize_deployment_commands=""

  # Extra values that will be passed to the `extra-manifests/kustomization.yaml.tpl` if its present.
  # extra_kustomize_parameters={}

  # See an working example for just a manifest.yaml, a HelmChart and a HelmChartConfig examples/kustomization_user_deploy/README.md

  # It is best practice to turn this off, but for backwards compatibility it is set to "true" by default.
  # See https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/349
  # When "false". The kubeconfig file can instead be created by executing: "terraform output --raw kubeconfig > cluster_kubeconfig.yaml"
  # Always be careful to not commit this file!
  create_kubeconfig = true

  # Don't create the kustomize backup. This can be helpful for automation.
  # create_kustomization = false

  # Export the values.yaml files used for the deployment of traefik, longhorn, cert-manager, etc.
  # This can be helpful to use them for later deployments like with ArgoCD.
  # The default is false.
  # export_values = true

  # MicroOS snapshot IDs to be used. Per default empty, the most recent image created using createkh will be used.
  # We recommend the default, but if you want to use specific IDs you can.
  # You can fetch the ids with the hcloud cli by running the "hcloud image list --selector 'microos-snapshot=yes'" command.
  # microos_x86_snapshot_id = "1234567"
  # microos_arm_snapshot_id = "1234567"

  ### ADVANCED - Custom helm values for packages above (search _values if you want to located where those are mentioned upper in this file)
  # ⚠️ Inside the _values variable below are examples, up to you to find out the best helm values possible, we do not provide support for customized helm values.
  # Please understand that the indentation is very important, inside the EOTs, as those are proper yaml helm values.
  # We advise you to use the default values, and only change them if you know what you are doing!

  # Cilium, all Cilium helm values can be found at https://github.com/cilium/cilium/blob/master/install/kubernetes/cilium/values.yaml
  # Be careful when maintaining your own cilium_values, as the choice of available settings depends on the Cilium version used. See also the cilium_version setting to fix a specific version.
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   cilium_values = <<EOT
ipam:
  mode: kubernetes
k8s:
  requireIPv4PodCIDR: true
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: "10.0.0.0/8"
endpointRoutes:
  enabled: true
loadBalancer:
  acceleration: native
bpf:
  masquerade: true
encryption:
  enabled: true
  type: wireguard
MTU: 1450
  EOT */

  # Cert manager, all cert-manager helm values can be found at https://github.com/cert-manager/cert-manager/blob/master/deploy/charts/cert-manager/values.yaml
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   cert_manager_values = <<EOT
installCRDs: true
replicaCount: 3
webhook:
  replicaCount: 3
cainjector:
  replicaCount: 3
  EOT */

  # csi-driver-smb, all csi-driver-smb helm values can be found at https://github.com/kubernetes-csi/csi-driver-smb/blob/master/charts/latest/csi-driver-smb/values.yaml
  # The following is an example, please note that the current indentation inside the EOT is important.
  /*   csi_driver_smb_values = <<EOT
controller:
  name: csi-smb-controller
  replicas: 1
  runOnMaster: false
  runOnControlPlane: false
  resources:
    csiProvisioner:
      limits:
        memory: 300Mi
      requests:
        cpu: 10m
        memory: 20Mi
    livenessProbe:
      limits:
        memory: 100Mi
      requests:
        cpu: 10m
        memory: 20Mi
    smb:
      limits:
        memory: 200Mi
      requests:
        cpu: 10m
        memory: 20Mi
  EOT */

}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

Screenshots

No response

Platform

Mac

mateuszlewko commented 8 months ago

Now when running again I see the following logs for systemctl status k3s-agent:


el=info msg="Waiting to retrieve agent configuration; server is not ready: CA cert validation failed: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
el=info msg="Waiting to retrieve agent configuration; server is not ready: https://127.0.0.1:6444/v1-k3s/serving-kubelet.crt: 503 Service Unavailable"
el=info msg="Waiting to retrieve agent configuration; server is not ready: https://127.0.0.1:6444/v1-k3s/serving-kubelet.crt: 503 Service Unavailable"
el=info msg="Waiting to retrieve agent configuration; server is not ready: https://127.0.0.1:6444/v1-k3s/serving-kubelet.crt: 503 Service Unavailable"
dperetti commented 8 months ago

This terraform thing is too flakey i'm afraid. As far as I'm concerned, it has worked for a few days. Now I cannot create nodeools anymore. It's stuck in creating state even though the servers are up and running in the Hetzner console. On the servers I get: "Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

andi0b commented 8 months ago

Might be connected to the (unresolved) discussion I started recently #1287

I'm having weird behaviour after updating the nodes with a recent microos update. Weird network connectivity issues, that I couldn't figure out yet (I just rolled back and disabled updates for now).

Edit: I also saw some "503 Service Unavailable" and "connection refused" in my logs. I know those are very generic errors, but still.

mysticaltech commented 8 months ago

@kube-hetzner/core Any ideas?

@mateuszlewko Try with cni_ plugin="cilium", I would guess it works better with wireguard.

kimdre commented 8 months ago

Hi, I got the same error timed out waiting for the condition on deployments/system-upgrade-controller with a very similiar configuration and cilium enabled.

valkenburg-prevue-ch commented 8 months ago

Hmm, I might have had the same too, I had nodes unable to come back to life after a reboot after a k3s upgrade. Replaced the nodes (long live longhorn) and turned off upgrades. Have not verified that this was the real problem though, but haven't seen it happen again either. No need to roll back anything though: the fresh nodes are on the latest k3s and have microos updating weekly without issues. Just not automatically upgrading k3s.

mysticaltech commented 8 months ago

Considering this as a occasional hiccup, but will monitor the situation.

mysticaltech commented 8 months ago

@kimdre Could you share your kube.tf please.

mysticaltech commented 8 months ago

@mateuszlewko Did you manage to make it work? What about you @andi0b ?

mateuszlewko commented 8 months ago

I disabled wireguard and recreated the cluster some time later. I haven't checked if wireguard works better with cillium or if that was the actual problem.

andi0b commented 8 months ago

@mysticaltech

Did you manage to make it work? What about you @andi0b ?

No, I'm currently on easter holiday and didn't investigate it more. I just disabled kured (I think with something like kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}') and rolled back the nodes to the last working snapshot (i think with transactional-update rollback [number]).

mysticaltech commented 8 months ago

Folks, this was probably due to a bug in system upgrade controller, now fixed. Make sure to upgrade with terraform init -upgrade. If such an issue comes again, please don't hesitate to open another one with your kube.tf. Closing this one for now.