kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.35k stars 363 forks source link

Waiting for MicroOS to reboot and become available... #308

Closed codeagencybe closed 2 years ago

codeagencybe commented 2 years ago

Hello

Not sure if it is an issue, but how long should it take before the deployment is ready? The terraform script is running for nearly 15 minutes now and it's taking a long time on waiting for MicroOS to become available.

I can see the VM's appeared in Hetzner console but my terminal keeps looping long time for becoming available. Also, I have configured volumes for each agent-node but it's not showing up in Hetzner console.

I have 3 masters and 3 worker nodes. Is this normal it takes that long? Or should I start investigating somewhere why it hangs on this stage?

image

codeagencybe commented 2 years ago

When I interrupt the process, it shows me output like this:

Error: local-exec provisioner error
│ 
│   with module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].hcloud_server.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 93, in resource "hcloud_server" "server":
│   93:   provisioner "local-exec" {
│ 
│ Error running command 'until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /tmp/b4duc9jpwlkpv9vu5l6j -o
│ ConnectTimeout=2 -p 22 root@xxx.xxx.xxx.xxx true 2> /dev/null
│ do
│   echo "Waiting for MicroOS to reboot and become available..."
│   sleep 3
│ done
│ ': signal: interrupt. Output: oOS to reboot and become available...

I checked the web console for each VM and it looks like the VM is ready, I can see the login into MicroOS. I'm not sure, but could there be something edge case with SSH happening when running this from an Apple M1 chip Macbookpro? I have set in the kube.tf the path to my ssh key files and as far as I can see, that key is showing up in Hetzner security > ssh keys. But maybe there is something wrong when it's generated from an M1 MacOS something? I can't figure out why but I have similar problem when trying the k3s Hetzner tool from Vito Botta his gem version. Running the tool is all good, but it fails also on the connecting part, looping infinite.

Anybody else who run in same issue? How did you solve it?

image

mysticaltech commented 2 years ago

Hello @codeagencybe, thanks for sharing those details. First of all, happy that you confirmed that you are on a Unix system. It should work from a MacBook, even with an M1 ship (I hope).

First, please read our SSH docs, and see if the kind of key you used is supported. What is happening is most likely that you are using an unsupported SSH key or one of the right kind but with a passphrase without doing the proper config.

After doing the required changes, if you encounter the same issue, please share your kube.tf in here without the sensitive values.

Good luck!

codeagencybe commented 2 years ago

@mysticaltech

Thanks for your feedback! I did follow all the steps from the SSH docs but it didn't make any difference. Side note: your docs are referring to terraform.tfvars but in fact the SSH key path are put in the kube.tf file, you might want to update your docs

I also double checked that my SSH key is working.

  1. the SSH key is appearing fine in console.hetzner under security > ssh keys
  2. when I spin up a VM manually and select that same existing SSH key that terraform pushed to HCLOUD, I can SSH into that manual created VM perfectly fine. So I can conclude that there is 100% for sure nothing wrong with the SSH key itself. I think there is something wrong with the networking or firewall or something that is preventing my local machine to communicate with the load balancer or the control plane or both.

This is my kube.tf setup below, I stripped some of the comments parts to make it shorter Some other side note: I want to use Longhorn with the volumes, but this is also not working. No volumes are created at all even while I added the longhorn volume to each agent node. No idea what is wrong here.

I just want 3 control planes, 3 workers nodes, each in all 3 EU locations add longhorn volumes to worker nodes use Cilium as CNI use Traefik as ingress

I think I have the configuration correct but I think there might be something wrong with the networking part so its completely isolated from outside.

Maybe you can see what I have wrong?

locals {
  # Fill first and foremost your Hetzner API token, found in your project, Security, API Token, of type Read & Write.
  hcloud_token = "secret"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token

  # Then fill or edit the below values. Only the first values starting with a * are obligatory; the rest can remain with their default values, or you
  # could adapt them to your needs.

  # * For local dev, path to the git repo
  # source = "../../kube-hetzner/" 
  # For normal use, this is the path to the terraform registry
  source = "kube-hetzner/kube-hetzner/hcloud"
  # you can optionally specify a version number
  # version = "1.2.0"

  # Customize the SSH port (by default 22)
  # ssh_port = 2222

  # * Your ssh public key
  ssh_public_key = file("/Users/codeagency/.ssh/id_ed25519.pub")
  # * Your private key must be "ssh_private_key = null" when you want to use ssh-agent for a Yubikey-like device authentification or an SSH key-pair with a passphrase.
  # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_private_key = file("/Users/codeagency/.ssh/id_ed25519")
  # You can add additional SSH public Keys to grant other team members root access to your cluster nodes.
  # ssh_additional_public_keys = []

  # These can be customized, or left with the default values
  # * For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
  network_region = "eu-central" # change to `us-east` if location is ash

  # * Example below:

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
      longhorn_volume_size = 10
   },
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
      longhorn_volume_size = 10
    },
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
      longhorn_volume_size = 10
    },
    #{
    #  name        = "storage",
    #  server_type = "cpx11",
    #  location    = "nbg1",
    #  # Fully optional, just a demo
    #  labels = [
    #    "node.kubernetes.io/server-usage=storage"
    #  ],
    #  taints = [
    #    "server-usage=storage:NoSchedule"
    #  ],
    #  count = 1
      # In the case of using Longhorn, you can use Hetzner volumes instead of using the node's own storage by specifying a value from 10 to 10000 (in GB)
      # It will create one volume per node in the nodepool, and configure Longhorn to use them.
    #  longhorn_volume_size = 10
    #}
  ]

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ### The following values are entirely optional (and can be removed from this if unused)

  # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
  base_domain = "cluster.mydomain.cloud"

  # To use local storage on the nodes, you can enable Longhorn, default is "false".
  enable_longhorn = true

  # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
  # longhorn_fstype = "xfs"

  # how many replica volumes should longhorn create (default is 3)
  longhorn_replica_count = 3

  # When you enable Longhorn, you can go with the default settings and just modify the above two variables OR you can copy the longhorn_values.yaml.example
  # file to longhorn_value.yaml and put it at the base of your own module, next to your kube.tf, this is Longhorn's own helm values file. 
  # If that file is present, the system will use it during the deploy, if not it will use the default values with the two variable above that can be customized.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.

  # Also, you choose to create a hetzner volume to be used with Longhorn. By default, it will use the nodes own storage space, BUT if you an attribute of
  # longhorn_volume_size with a value of 10 to 10000 GB to your agent nodepool definition, it will create and use the volume in question.
  # See the agent nodepool section for an example of how to do that.

  # To disable Hetzner CSI storage, you can set the following to true, default is "false".
  # disable_hetzner_csi = true

  # If you want to use a specific Hetzner CCM and CSI version, set them below; otherwise, leave them as-is for the latest versions.
  # hetzner_ccm_version = ""
  # hetzner_csi_version = ""

  # If you want to specify the Kured version, set it below - otherwise it'll use the latest version available.
  # kured_version = ""

  # If you want to enable the Nginx ingress controller (https://kubernetes.github.io/ingress-nginx/) instead of Traefik, you can set this to "true". Default is "false". 
  # FOR THIS TO NOT BE IGNORED, you also need to set "enable_traefik = false".
  # By the default we load an optimal Nginx ingress controller config for Hetzner, however you may need to tweak it to your needs, so to do,
  # we allow you to add a nginx_ingress_values.yaml file to the root of your module, next to the kube.tf file, it is simply a helm values config file.
  # See the nginx_ingress_values.yaml.example located at the root of this project.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.
  # enable_nginx = true

  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".
  # enable_traefik = false

  # Use the klipper LB, instead of the default Hetzner one, that has an advantage of dropping the cost of the setup,
  # Automatically "true" in the case of single node cluster.
  # It can work with any ingress controller that you choose to deploy.
  # enable_klipper_metal_lb = "true"

  # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
  # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
  # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
  # traefik_acme_tls = true
  # traefik_acme_email = "info@mydomain.cloud"

  # If you want to configure additional Arguments for traefik, enter them here as a list and in the form of traefik CLI arguments; see https://doc.traefik.io/traefik/reference/static-configuration/cli/
  # They are the options that go into the additionalArguments section of the Traefik helm values file.
  # Example: traefik_additional_options = ["--log.level=DEBUG", "--tracing=true"]
  # traefik_additional_options = []

  # If you want to disable the metric server, you can! Default is "true".
  # enable_metrics_server = false

  # If you want to allow non-control-plane workloads to run on the control-plane nodes, set "true" below. The default is "false".
  # True by default for single node clusters.
  # allow_scheduling_on_control_plane = true

  # If you want to disable the automatic upgrade of k3s, you can set this to false. The default is "true".
  # automatically_upgrade_k3s = false

  # Allows you to specify either stable, latest, testing or supported minor versions (defaults to stable)
  # see https://rancher.com/docs/k3s/latest/en/upgrades/basic/ and https://update.k3s.io/v1-release/channels
  # initial_k3s_channel = "latest"

  # The cluster name, by default "k3s"
  cluster_name = "mycloud"

  # Whether to use the cluster name in the node name, in the form of {cluster_name}-{nodepool_name}, the default is "true".
  # use_cluster_name_in_node_name = false

  # Adding extra firewall rules, like opening a port
  # More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall
  # extra_firewall_rules = [
  #   # For Postgres
  #   {
  #     direction       = "in"
  #     protocol        = "tcp"
  #     port            = "5432"
  #     source_ips      = ["0.0.0.0/0", "::/0"]
  #     destination_ips = [] # Won't be used for this rule 
  #   },
  #   # To Allow ArgoCD access to resources via SSH
  #   {
  #     direction       = "out"
  #     protocol        = "tcp"
  #     port            = "22"
  #     source_ips      = [] # Won't be used for this rule 
  #     destination_ips = ["0.0.0.0/0", "::/0"]
  #   }
  # ]

  # If you want to configure a different CNI for k3s, use this flag
  # possible values: flannel (Default), calico, and cilium
  # CAVEATS: Calico is not supported when not using the Hetzner LB (like when enable_klipper_metal_lb is set to true or when using a single node cluster),
  # because of the following issue https://github.com/k3s-io/klipper-lb/issues/6.
  # As for Cilium, we allow infinite configurations, please check the CNI section of the readme over at https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/#cni.
  cni_plugin = "cilium"

  # If you want to disable the k3s default network policy controller, use this flag!
  # Both Calico and Ciliun cni_plugin values override this value to true automatically, the default is "false".
  # disable_network_policy = true

  # If you want to disable the automatic use of placement group "spread". See https://docs.hetzner.com/cloud/placement-groups/overview/
  # That may be useful if you need to deploy more than 500 nodes! The default is "false".
  # placement_group_disable = true

  # By default, we allow ICMP ping in to the nodes, to check for liveness for instance. If you do not want to allow that, you can. Just set this flag to true (false by default).
  # block_icmp_ping_in = true

  # You can enable cert-manager (installed by Helm behind the scenes) with the following flag, the default is "false".
  enable_cert_manager = true

  # IP Addresses to use for the DNS Servers, set to an empty list to use the ones provided by Hetzner, defaults to ["1.1.1.1", " 1.0.0.1", "8.8.8.8"].
  # For rancher installs, best to leave it as default.
  # dns_servers = []

  # When this is enabled, rather than the first node, all external traffic will be routed via a control-plane loadbalancer, allowing for high availability.
  # The default is false.
  use_control_plane_lb = true

  # You can enable Rancher (installed by Helm behind the scenes) with the following flag, the default is "false".
  # When Rancher is enabled, it automatically installs cert-manager too, and it uses rancher's own self-signed certificates.
  # See for options https://rancher.com/docs/rancher/v2.0-v2.4/en/installation/resources/advanced/helm2/helm-rancher/#choose-your-ssl-configuration
  # The easiest thing is to leave everything as is (using the default rancher self-signed certificate) and put Cloudflare in front of it.
  # As for the number of replicas, by default it is set to the numbe of control plane nodes.
  # You can customized all of the above by adding a rancher_values.yaml file at the root of your module, which is just a helm values file. 
  # See the rancher_values.yaml.example file located at the root of the project.
  # After the cluster is deployed, you can always use HelmChartConfig definition to tweak the configuration.
  # IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane).
  # ALSO, in order for Rancher to successfully deploy, you have to set the "rancher_hostname".
  enable_rancher = true

  # If using Rancher you can set the Rancher hostname, it must be unique hostname even if you do not use it.
  # If not pointing the DNS, you can just port-forward locally via kubectl to get access to the dashboard.
  rancher_hostname = "rancher.mydomain.cloud"

  # When Rancher is deployed, by default is uses the "latest" channel. But this can be customized.
  # The allowed values are "stable" or "latest".
  # rancher_install_channel = "stable"

  # Finally, you can specify a bootstrap-password for your rancher instance. Minimum 48 characters long!
  # If you leave empty, one will be generated for you.
  # (Can be used by another rancher2 provider to continue setup of rancher outside this module.)
  # rancher_bootstrap_password = "secret"

  # Separate from the above Rancher config (only use one or the other). You can import this cluster directly on an
  # an already active Rancher install. By clicking "import cluster" choosing "generic", giving it a name and pasting
  # the cluster registration url below. However, you can also ignore that and apply the url via kubectl as instructed
  # by Rancher in the wizard, and that would register your cluster too.
  # More information about the registration can be found here https://rancher.com/docs/rancher/v2.6/en/cluster-provisioning/registered-clusters/
  # rancher_registration_manifest_url = "https://rancher.xyz.dev/v3/import/xxxxxxxxxxxxxxxxxxYYYYYYYYYYYYYYYYYYYzzzzzzzzzzzzzzzzzzzzz.yaml"
}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.2.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.35.1"
    }
  }
}
codeagencybe commented 2 years ago

One more thing I just noticed,

While my terminal keeps repeating about waiting for MicroOS to be available, I noticed that the LB service itself in Hetzner console has *no targets" at all.

Maybe this gives you a clue about what it could be? I also tried SSH into every single node directly, but that returns this immediately:


❯ ssh -i id_ed25519 root@xxx.xxx.xxx.xxx
The authenticity of host 'xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx)' can't be established.
ED25519 key fingerprint is SHA256:secret.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
Received disconnect from xxx.xxx.xxx.xxx port 22:2: Too many authentication failures
Disconnected from xxx.xxx.xxx.xxx port 22

This is mostlikely the reason why Terraform can't see the status from MicroOS and thus keeps waiting forever. My machine got blocked

mysticaltech commented 2 years ago

@codeagencybe For longhorn, it happens later, so it's expected you do not see it yet.

As for the kube.tf, it seems ok. For SSH being blocked also normal, as after a few tries, the security in microOS kicks in.

Something I did not mention, was the need to ALWAYS, terraform destroy, before trying again. So that you start fresh at every attempt!

Please try WITHOUT passphrase, for id_ed25519 it's secure without, this is what I use personally. And let me know.

codeagencybe commented 2 years ago

@mysticaltech

I did use terraform destroy after each attempt. I even created a fresh new Hetzner project with different API key just to verify nothing got wrong in there. Still same output.

I don't have a passphrase set, never used it. As I said earlier, the exact same existing SSH key works fine when I spin up a VM manually with Ubuntu. So I run the terraform init, validate and then apply. I let it do it's thing until 15 minutes and it keeps looping at "waiting for Microos to become available". At that point, I go to console.hetzner.com and I create 1 new VM manually and for ssh key I select the existing one that your terraform script created already.

I run manually ssh -i /path/to/same/key root@IP and boom I'm in immediately. So this confirms nothing is wrong with the ssh key itself. There is no passphrase to deal with.

I had somebody else try yesterday evening in a Zoom call, he also confirmed everything is done correct in relation to ssh key. The problem can not be the ssh key.

I think the problem is coming from either the firewall, the load balancer or a mix of something. I can repeat this problem over and over. As I said, the load balancer has NO targets at all after the terraform apply. And when I try to manually ssh into one of the control plane nodes, it immediately returns that I'm blocked due to too many attemps.

If you want I can show you personally from my Macbook in a Zoom call or Google meet if you want to see yourself. But I really think there is something going on with the firewall or LB or a combination. I tried with Flannel en Cilium as CNI, same result. What else can I try?

codeagencybe commented 2 years ago

@mysticaltech

I just did the whole thing again from an Ubuntu 22.04LTS machine. Used the same ssh key, Now I'm getting some progress but a different error

Error: attach server to network: provided IP is not available (ip_not_available)
│ 
│   with module.kube-hetzner.module.agents["1-0-agent-small"].hcloud_server_network.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 164, in resource "hcloud_server_network" "server":
│  164: resource "hcloud_server_network" "server" {
│ 
╵
╷
│ Error: attach server to network: provided IP is not available (ip_not_available)
│ 
│   with module.kube-hetzner.module.agents["2-0-agent-small"].hcloud_server_network.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 164, in resource "hcloud_server_network" "server":
│  164: resource "hcloud_server_network" "server" {
│ 

But now I see that load balancer is getting targets and the private network also getting the servers in the resources. I think there is something specific M1 Apple problem because from Ubuntu it seems to work better.

mysticaltech commented 2 years ago

@codeagencybe Thanks for trying that hard. The IP not available error from Ubuntu comes from the FSN1 location that is often overloaded on Hetzner's side, so best change to another one.

Yes, it really could be coming from the M1 ship! But just to be sure, I will try your config, as I also believe SSH is not involved.

As for the LB targets, that has nothing to do with the installation at this point (all is good), as the LB IS NOT USED DURING THE SETUP; it is just deployed after the main setup by the CCM.

I will try on my end, read more on M1 problems, and let you know.

mysticaltech commented 2 years ago

I am trying now, but the first thing I am seeing (and missed sooner) is that you are using the CP LB, now I understand your previous mentions of targets, but even that will be added later on. However, the thing that is not ok is your setup is using Rancher with such small nodes... Rancher, from testing, requires at least CX21 nodes or at least 4GB of RAM, and you are using CPX11.

The the note on top of the enable_rancher setting: # IMPORTANT: Rancher's install is quite memory intensive, you will require at least 4GB if RAM, meaning cx21 server type (for your control plane).

I am trying now and will shortly either confirm or deny the above.

mysticaltech commented 2 years ago

Got the same error as you Unbuntu try, but here's the problem (I was wrong above about the location thing). You are using the same name for all agent nodepools, you need to differentiate each of them.

ksnip_20220918-220232

ksnip_20220918-220245

mysticaltech commented 2 years ago

Ok, all good. I differentiated the agent nodepool names, like so, and everything worked!

ksnip_20220918-221720

ksnip_20220918-221709

Even Longhorn volumes show up well: ksnip_20220918-222003

mysticaltech commented 2 years ago

So to go back to the question of node type for Rancher, indeed, it's important, with cx11, things, crash when Rancher runs:

ksnip_20220918-222650

But when I switched all CP nodes to CX21, everything runs just fine (don't worry about the helm operation, that goes away after some time): ksnip_20220918-225704

So worth the switch of CP nodes to CX21 if you are going to use Rancher on that cluster directly. Personally, I just deploy a single node cluster (1 CX21 control plane) for it and add all of my clusters to it, to manage them easily.

Also, remember (giving you heads up here), if you already deployed, always good to know that when you "terraform destroy" the cluster (if ever needed), it will pend at the network deletion; that is explained in the docs in the "destroy" section and is due to the CCM requested LB, which in your case you can delete with the command hcloud load-balancer delete mycloud (or via the UI, but best to learn to use the hcloud cli) when you see that the deletion is pending on the network.

mysticaltech commented 2 years ago

Hope the above helps @codeagencybe, when all the changes are done on your side, both for the agent nodepool names and the type of CP nodes. Please try again on your mac. Maybe it will work this time (who knows)!

If not, I found this and that.

Maybe you could clone the project repo, point your module in your kube.tf to the path where the repo is located, then find all instances of ssh calls in command form, and add the -o IPQoS=throughput to these SSH commands (as suggested in on of the articles), to see if that works. Let me know if that does it, and I could create a variable to enable that fix for M1 macs. Or also try the NVRAM reset that was suggested by the other article.

One thing is sure, it can - and will run on your mac! We just need to figure out how! But on Ubuntu at least, it should be fine from the get-go!

codeagencybe commented 2 years ago

@mysticaltech

Aha that could indeed explain the problem. I just did a change to CPX31 for everything and changed the agent pool name and I think we have a winner now! On Ubuntu everything is working fine.

image

My plan was to use CPX51 for production use anyway, but for a quick test I was assuming CPX11 would be sufficient as it's only testing. Turns out it really does matter :p

So Ubuntu is confirmed, it works fine from here. I'm gonna try if I can replicate my success on Macbook pro M1 also now.


Some side questions/remarks:

  1. I noticed after the run is completed, I have 2 LB's? Is that correct? image One is pointing only to the control planes and the other is pointing to all 7 servers, including those same controlplanes again. I don't know if this is intended behaviour, I was expecting only 1 LB to show up.

  2. I had to pick 1 LB type, so I picked the default LB11 but not sure in what way this can scale automatically? Is it as easy to upgrade this is required to an LB31? Is it even required? If I deploy applications like multilple Wordpress, Magento, ... it's going to route all the traffic to the same single LB service? Or is it going to create a new service inside the LB for every application? Not so clear yet to me at this point

  3. Can the LB also automatically scale or change location in case of issues? Because I can only select 1 location for the LB. I don't know how "reliable" it is in case there is a problem with the single LB11. Will it make everything unavailable that is running in the cluster? Or is it going to self heal by changing to another location?

  4. I noticed a mistake in your example kube.tf at line 139 -> # base_domain = mycluster.example.com should be: # base_domain = "mycluster.example.com" with quotes If I do without quotes, then terraform validate returns an error about unknown mycluster example.com

  5. I now have block storage attached to every VM itself, but I also have an extra VM "storage" and I'm confused what is the proper setup for this? What is the best performance setup with Longhorn? I think using block storage is the easiest and scalable vs using the node storage. But is it? And does it need a storage pool if I want to use Longhorn or can I simply comment out the storage part? Or is the idea to spin up another pool only for storage and add block storage only to the storage pool? This part keeps confusing me because it can be used in many different ways. I would like to have the best performance and resilience for recovery in production but what is the recommendation here?

  6. If I use the block storage and add the line per node, how can I scale it up if necessary? I just did a simple test for now with 10GB but how does it handle increasing storage limits? Do I need to do this from Terraform manually? Or can I do this from Rancher web UI manually also?

Thanks! I will post in a moment results from Macbook also.

codeagencybe commented 2 years ago

@mysticaltech

On Mac it still is same problem unfortunately. I tried with exact same SSH key from my Ubuntu machine and it fails immediately on Apple. So this is defintely an issue wiht MacOS specific.

I also tried the suggestions you gave. It actually solved another problem I had with rsync (thanks for that), but unfortunately no this problem. Did a few hours of Googling and found that many people seem to experience problems with SSH from MacOS Monterey and upwards. Most of them are coming from issues with some beta version, but I'm not running any beta version. It's brand new Macbook pro M1 from 2 weeks ago, everything is stock so far. I keep searching for the Mac problem, because it seems the root cause is coming from Monterey and SSH sh** updates.

mysticaltech commented 2 years ago

@codeagencybe , very good to hear!

Will answer your questions one by one.

  1. I noticed after the run is completed, I have 2 LB's? Is that correct?

This is because you set the flag use_control_plane_lb to true. So we deploy another LB just for the control-plane, otherwise, if your first control plane node reboots to update itself for instance, or God forbid crashes, you have to manually point any services that rely on the Kube API, like kubectl to another control plane node IP manually (in your case, the second or third CP node). That LB does that for you automatically, making access to your cluster HA. Personally, I just do not feel the need for it in my own setups.

  1. I had to pick 1 LB type, so I picked the default LB11 but not sure in what way this can scale automatically?

No, it won't scale automatically; moredetails below.

Is it as easy to upgrade this is required to an LB31? Is it even required? If I deploy applications like multilple Wordpress, Magento, ... it's going to route all the traffic to the same single LB service? Or is it going to create a new service inside the LB for every application?

  1. Can the LB also automatically scale or change location in case of issues? Because I can only select 1 location for the LB. I don't know how "reliable" it is in case there is a problem with the single LB11. Will it make everything unavailable that is running in the cluster? Or is it going to self heal by changing to another location?

Now am talking about the default LB, the one related to workloads and services on your cluster, the one connected by default to the Traefik ingress controller (or Nginx for others). There is nothing automatic.

You have to change the variables for both type and location of the LB and apply again. Though, am not sure if that will work, please try and let me know. This is important to sync your state. But if the changes are not applied, you can always apply the HelmChartConfig of the Traefik controller, there both these options can be set manually, you will find for instance a "load-balancer.hetzner.cloud/type" key.

  1. I noticed a mistake in your example kube.tf at line 139 -> # base_domain = mycluster.example.com should be: # base_domain = "mycluster.example.com" with quotes If I do without quotes, then terraform validate returns an error about unknown mycluster example.com

Ok, good catch... Just fixed it now thanks to you!

  1. I now have block storage attached to every VM itself, but I also have an extra VM "storage" and I'm confused what is the proper setup for this? What is the best performance setup with Longhorn? I think using block storage is the easiest and scalable vs using the node storage. But is it? And does it need a storage pool if I want to use Longhorn or can I simply comment out the storage part? Or is the idea to spin up another pool only for storage and add block storage only to the storage pool? This part keeps confusing me because it can be used in many different ways. I would like to have the best performance and resilience for recovery in production but what is the recommendation here?

Please do not hesitate to post a screenshot of the extra VM storage you see in Longhorn. Normally, in your case, as you specified the longhorn_volume_size, it creates 1 volume for each agent node with that attribute and mounts /var/longhorn to it. In the case of the absence of the size attribute, it just uses /var/longhorn by taking space from the node. For me personally, that is enough as depending on the nodes, you have loads of space, and you avoid the tiny network latencies for read and write ops.

And since you have replications, all is good even if the node crashes and burns lol. But for those who want volumes, just set the size as you did, and that's it; it will use volumes. Otherwise, you can just use the Hetzner CSI, which is also deployed in your case and uses volumes too. If you are going to just use Longhorn instead (as more flexible) you can disable Hetzner CSI; there is a flag for that, but always good to have options IMHO (so best leave it).

  1. If I use the block storage and add the line per node, how can I scale it up if necessary? I just did a simple test for now with 10GB but how does it handle increasing storage limits? Do I need to do this from Terraform manually? Or can I do this from Rancher web UI manually also?

Honestly, this is a new feature; I don't know if increasing after cluster deploy will work; normally it should work, but with a destructive result on the data present if am not mistaken; however since you have replication, just do one size increase at a time IMHO. Try it and let me know; I am curious if this works as it should. Never hesitate to open new issues to compartmentalize matters well.

Thanks! I will post in a moment results from Macbook also.

Awesome, if you ever find a fix for that. Just send a PR or open a new issue detailing the steps; that would be greatly appreciated. Closing this issue for now, as essential matters have been resolved! :) (Comments should remain open).

codeagencybe commented 2 years ago

@mysticaltech

OK thanks for the feedback! I'm definitely going to play around with some of the params for the storage part. I do want to use Longhorn for sure, but for some reason block storage volumes sound "safer" to use. I don't have any idea how it handles performance though, need to test more with some actual applications first. But volumes do sound "easier" to recover from, just a gut feeling. Maybe it's just as simple with node storage. I just need to figure out how it works if I deploy a simple application with just 1 replica instead of 3. If something happens with the node that holds that data, then what happens? Because the node is not accessible, so my data is also unaccessible. How does K3s recover the application from such a state? I know Longhorn has build support for backups to S3, so I could use that (I think) but that would mean there can be a data discrepancy in time (last backup time between crash and recovery) so that could result in some data loss. Anyway, plenty to play around with now :D

Regarding the Macos problem, I think i found and solved the problem by running some ssh testing with heavy verbose output. It seems be coming from 1password. I don't know how/why yet, but I did some testing by disabling it completely, uninstall and all of a sudden it started working. Reinstalled 1password and the problem came back. I think it's adding some integration into terminal/CLI that is looping through ssh keys to try authenticating instead of using the ssh key I already set to use explicitly. Maybe I can find some option in settings where I can disable this feature or configure it properly. But for now, this seems to be the root cause in my end. Maybe some other users can relate to same problem and are also using some kind of password manager that integrates into CLI. Perhaps you can copy this part and pin it somewhere in your readme to remind people about it. This can save many hours of headbreaking hahaha

mysticaltech commented 2 years ago

Ah, awesome to know that you found the cause of the macOS M1 issue with "waiting for MicroOS to become available"! I would have never guessed that a password manager could interfere with SSH...

About Longhorn, no need to recover; just set your replication to 3, so your data will be copied on three different nodes; that way, even if one crashes, Longhorn will have two more copies just in case, and it recovers all alone!

As for Kube, even if a node or agent crashes, you have other nodes that will take over. For your cluster data to be lost, something catastrophic must take out all your control planes simultaneously, which is quite unlikely!

But backups of the entire cluster are never a bad idea, even though I doubt you will ever need them. This utility can help you with it and is also helpful in upgrading future major versions of this cluster if we ever need to introduce breaking changes (hopefully never): https://github.com/vmware-tanzu/velero.

As you said, playing around is the best way to get to know something! Enjoy :)

codeagencybe commented 2 years ago

@mysticaltech Ah yes, that one left me completely puzzled also but it seems like a thing now with Apple security and something with their "chain" something that ties in password managers.

In any case, here is the screenshot from 1password, it might help with the understand of this issue.

image image

Regarding Longhorn, things are become more clear already. So basically I can run 1 pod (no replication) with eg Wordpress but set the volume for 1 pod to 3 replicas. In that case, if the node gets lost, it can recover the pod immediately on a different node.

In terms of storage, when using the node storage I can get limited as it is not possible to increase node storage vs block storage volumes that can scale very high out of the box. On the other hand, if i pick CPX51 machines, I get 360 GB storage included with each node. That can hold quite some applications already. And I assume if I increase the agent node pool with more servers, Longhorn will distribute the storage over all the agent nodes anyway.

I read that Longhorn has disaster recovery build in too. I already use WASABI S3 for bucket storage, so all backups go there anyway by default. I can look into Velero but I think Longhorn already has it covered out of the box too. https://longhorn.io/docs/1.3.1/snapshots-and-backups/setup-disaster-recovery-volumes/

The only thing I haven't figured out yet, is how do I access the rancher web ui after deployment? I have it enabled, I assume it has been installed. I have also enabled cert manager but no idea if it did something. I have set a custom domain for it, also for the cluster base domain. Do I still need to expose something manually for this or is this handled already out of the box?

mysticaltech commented 2 years ago

@codeagencybe Rancher should work out of the box with its self-generated SSL certificate (so you have to trust it and proceed). To get there, just point in your domain provider zone the hostname (via an A record) to the IP of your LB. And then navigate there with "https://" :)

For Longhorn replication, what I was saying was not about the pod replicas but the longhorn_replica_count variable that you can set; by default, it's 3, so that should be enough. Your data is replicated three times; it doesn't matter how many pod replicas you have!

About the 1Password, thank God for Google Translate, which can translate the text inside images 😂. If I were you, I would try disabling all of this, basically completely removing 1Password from your SSH flows (only temporarily if you want, just to be able to deploy).

Good to learn about the built-in recovery of Longhorn! 🙏

codeagencybe commented 2 years ago

@mysticaltech

I already found it :D It was a DNS cache issue from my browser. When I tried in another browser, rancher web UI was showing fine.

Yes, I have disabled all that stuff from 1password and that fixed the problem. 1password (and maybe others too) are now being "invasive" because they want to offer passwordless features also in the terminal with TouchID, yubikey etc... Its a cool feature though, but seems like "not there yet" and breaking things instead of solving problems. So they are hijacking terminal ssh sessions I think. In one of the verbose output I was seeing it tried to authenticate with several dozens keys from 1password, completely ignoring the ssh key I specified in kube.tf. Hence the reason why I got blocked because it was trying with 50+ keys.

mysticaltech commented 2 years ago

Ah, wonderful! 50+ Immediately, the MicroOS instance blacklisted you, lol. Good, that this is now resolved, and there are tons of valuable answers on this page for folks that need them.