kube-hetzner / terraform-hcloud-kube-hetzner

Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
MIT License
2.47k stars 375 forks source link

[Bug]: remote-exec provisioner error #1135

Closed carstenblt closed 11 months ago

carstenblt commented 11 months ago

Description

Applying fails with

╷
│ Error: remote-exec provisioner error
│
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 278, in resource "null_resource" "kustomization":
│  278:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_579678025.sh": Process exited with status 1
╵
╷
│ Error: remote-exec provisioner error
│
│   with module.kube-hetzner.null_resource.kustomization_user_deploy[0],
│   on .terraform/modules/kube-hetzner/kustomization_user.tf line 80, in resource "null_resource" "kustomization_user_deploy":
│   80:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_1586559340.sh": Process exited with status 1
╵

As this might have something to do with extra-manifests, here is my extra-manifests/kustomization.yaml.tpl:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ./external-secrets.yaml

and my extra-manifests/external-secrets.yaml.tpl:

apiVersion: v1
kind: Namespace
metadata:
  name: external-secrets
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: external-secrets
  namespace: external-secrets
spec:
  chart: external-secrets
  targetNamespace: external-secrets
  repo: https://charts.external-secrets.io
  valuesContent: |-
    installCRDs: true

Kube.tf file

locals {
  hcloud_token = "xxx"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"

  version = "2.11.0"

  ssh_public_key = file("~/.ssh/id_ed25519.pub")
  ssh_private_key = file("~/.ssh/id_ed25519")
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cax11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 5
    }
  ]
  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  base_domain = "xxx"
  etcd_s3_backup = {
     etcd-s3-endpoint        = "xxx"
     etcd-s3-access-key      = "xxx"
     etcd-s3-secret-key      = "xxx"
     etcd-s3-bucket          = "xxx"
  }
  enable_longhorn = true
  disable_hetzner_csi = true
  ingress_controller = "none"
  kured_options = {
     "reboot-days": "sa,su"
     "start-time": "3am"
     "end-time": "8am"
     "time-zone": "Local"
   }

  cluster_name = "xxx"
  restrict_outbound_traffic = false
  additional_tls_sans = ["xxx"]
  lb_hostname = "xxx"
  create_kubeconfig = false
  nginx_values = <<EOT
controller:
  kind: "DaemonSet"
  service:
    annotations:
      "load-balancer.hetzner.cloud/name": "tarabao"
      "load-balancer.hetzner.cloud/use-private-ip": "true"
      "load-balancer.hetzner.cloud/protocol": "https"
      "load-balancer.hetzner.cloud/certificate-type": "managed"
      "load-balancer.hetzner.cloud/http-managed-certificate-name": "xxx"
      "load-balancer.hetzner.cloud/http-managed-certificate-domains": "*.xxx"
      "load-balancer.hetzner.cloud/http-redirect-http": "false"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
    ports:
      http: 443
  EOT
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.3.3"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.38.2"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

MacOS

mysticaltech commented 11 months ago

@carstenblt The best way forward here, is to SSH into the server when that error happens (see readme debug section) and run the script directly, in that previous attempt you would have run /tmp/terraform_1586559340.sh directly just executing it as in in your terminal, that will show the error message, please share that, and also the content of that file (without sensitive values).

carstenblt commented 11 months ago

@mysticaltech Thank you for your reply. The two scripts that fail:

#!/bin/sh
set -ex
sed -i 's/^- |[0-9]\+$/- |/g' /var/post_install/kustomization.yaml
timeout 360 bash <<EOF
  until [[ "\$(kubectl get --raw='/readyz' 2> /dev/null)" == "ok" ]]; do
    echo "Waiting for the cluster to become ready..."
    sleep 2
  done
EOF

kubectl apply -k /var/post_install
echo 'Waiting for the system-upgrade-controller deployment to become available...'
kubectl -n system-upgrade wait --for=condition=available --timeout=360s deployment/system-upgrade-controller
sleep 7
kubectl -n system-upgrade apply -f /var/post_install/plans.yaml

fails with

+ sed -i 's/^- |[0-9]\+$/- |/g' /var/post_install/kustomization.yaml
sed: can't read /var/post_install/kustomization.yaml: Not a directory

and

#!/bin/sh
rm -f /var/user_kustomize/*.yaml.tpl
echo 'Deploying manifests from /var/user_kustomize/:' && ls -alh /var/user_kustomize
kubectl kustomize /var/user_kustomize/ | kubectl apply --wait=true -f -

fails with

Deploying manifests from /var/user_kustomize/:
-rw-r--r--. 1 root root 334 Dec 22 11:50 /var/user_kustomize
error: must build at directory: '/var/user_kustomize/': file is not directory
error: no objects passed to apply

Indeed, /var/user_kustomize is not a directory but rather my extra-manifests/external-secrets.yaml.tpl file and /var/post_install is a kured DaemonSet definition.

mysticaltech commented 11 months ago

@carstenblt Very interesting, I will try to reproduce and look into it ASAP.

libracoder commented 11 months ago

@mysticaltech have you had time to look at this, I am experiencing the same issue.

mysticaltech commented 11 months ago

@libracoder Sorry not yet, will try tonight.

libracoder commented 11 months ago

Oh great, Thank you!

libracoder commented 11 months ago

I think i found the issue @mysticaltech

There is a wierd "" character appended to the private keys each time its created, so the local-exec is unable to find the ssh key

image

libracoder commented 11 months ago

Strange discovery This works command = "install -b -m 600 /dev/null /tmp/${random_string.identity_file.id} && echo ${file("~/.ssh/id_ed25519")} > /tmp/${random_string.identity_file.id}"

This dosent

  provisioner "local-exec" {
    command = <<-EOT
      install -b -m 600 /dev/null /tmp/${random_string.identity_file.id}
      echo "${local.ssh_client_identity}" > /tmp/${random_string.identity_file.id}
    EOT
  }
mysticaltech commented 11 months ago

@libracoder Your issue may be different from @carstenblt's one. Which platform are you on? If on Windows, please make sure to use WSL.

libracoder commented 11 months ago

Yeah, I think so I am having a myraid of issue. I am on windows using WSL. After fixing the issue with the file names, now I have these errors

module.kube-hetzner.module.agents["0-0-agent-small"].hcloud_server.server (local-exec): Executing: ["/bin/sh" "-c" "until ssh -o UserKnownHostsFile=/dev/null -o StrictHos
tKeyChecking=no -o IdentitiesOnly=yes -o PubkeyAuthentication=yes -i /tmp/3rgtcngppy7cskz6vtq8 -o ConnectTimeout=2 -p 22 root@128.140.58.213 true 2> /dev/null\r\ndo\r\n  echo \"Waiting for MicroOS to become available...\"\r\n  sleep 3\r\ndone\r\n"]
module.kube-hetzner.module.agents["0-0-agent-small"].hcloud_server.server (local-exec): /bin/sh: 6: Syntax error: end of file unexpected (expecting "do")
╷
│ Error: local-exec provisioner error
│
│   with module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 60, in resource "hcloud_server" "server":
│   60:   provisioner "local-exec" {
│
│ Error running command 'until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -o PubkeyAuthentication=yes -i
│ /tmp/4tc950ymt3qc42lez113 -o ConnectTimeout=2 -p 22 root@49.12.185.102 true 2> /dev/null
│ do
│   echo "Waiting for MicroOS to become available..."
│   sleep 3
│ done
│ ': exit status 2. Output: /bin/sh: 6: Syntax error: end of file unexpected (expecting "do")
│
╵
╷
│ Error: local-exec provisioner error
│
│   with module.kube-hetzner.module.agents["0-0-agent-small"].hcloud_server.server,
│   on .terraform/modules/kube-hetzner/modules/host/main.tf line 60, in resource "hcloud_server" "server":
│   60:   provisioner "local-exec" {
│
│ Error running command 'until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -o PubkeyAuthentication=yes -i
│ /tmp/3rgtcngppy7cskz6vtq8 -o ConnectTimeout=2 -p 22 root@128.140.58.213 true 2> /dev/null
│ do
│   echo "Waiting for MicroOS to become available..."
│   sleep 3
│ done
│ ': exit status 2. Output: /bin/sh: 6: Syntax error: end of file unexpected (expecting "do")
│
╵
libracoder@Libracoder-Surface8-Pro:/var/www/ht-cloud$
mysticaltech commented 11 months ago

Ok, @libracoder please search this repo, others have successfully deployed via WSL-2, it definitely works. And you may want to have a look at that short but important SSH guide, your keys need to be created in that manner from WSL too. https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/docs/ssh.md

libracoder commented 11 months ago

Thank you so much for you help. I will go through the docs you shared.

mysticaltech commented 11 months ago

@libracoder That's one issue that just showed up on my radar with a solution, and there are others too. https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1140

mysticaltech commented 11 months ago

@carstenblt FYI, just pushed the fix in v2.11.2, and tested with your own example, it worked like a charm. Good luck!

mysticaltech commented 11 months ago

Screenshot from 2023-12-31 00-02-44

carstenblt commented 11 months ago

@carstenblt FYI, just pushed the fix in v2.11.2, and tested with your own example, it worked like a charm. Good luck!

+ sed -i 's/^- |[0-9]\+$/- |/g' /var/post_install/kustomization.yaml
sed: can't read /var/post_install/kustomization.yaml: Not a directory

is still the same. This is the kured plan.

The other changed:

/tmp/terraform_1365493471.sh: line 3: unexpected EOF while looking for matching `''

This is because the file looks like this, typo:

#!/bin/sh
rm -f /var/user_kustomize/*.yaml.tpl
echo 'Applying user kustomization...
kubectl apply -k /var/user_kustomize/ --wait=true

Sorry I'm not of much help, I don't know any terraform. But I believe the problem might be, that the creation of the directories is only done in the resource "null_resource" "first_control_plane" section, but it should be done on all control_planes? The file provisioner then does not work because apparently folders are not created when using ssh: https://developer.hashicorp.com/terraform/language/resources/provisioners/file It might be that the problem arose after moving one control node to a new group because I wanted it to move to another location.

mysticaltech commented 11 months ago

@carstenblt Terraform destroy, make sure you are on a proper linux shell like WSL and try again, if it worked for me with this very same setup, it should work for you.

mysticaltech commented 11 months ago

@carstenblt Thanks for your PR #1143 , it was merged and deployed in v2.11.3.

carstenblt commented 11 months ago

@mysticaltech I believe this is what breaks it:

I'll try to check later if this reproduces it.

mysticaltech commented 11 months ago

@carstenblt Of course, that's not supposed to happen. See readme and kube.tf.example, you can only scale up and down nodepools, and it should be done very carefully with draining proper node deletion etc. And once HA you cannot scale down to non-HA in my experience, what you could do is create a new nodepool with a count of 1, bringing the total control planes to 4, then drain the last node of first nodepool, do kubectl delete node and scale down the count to 2.

otavio commented 11 months ago

It seems that something is wrong. Please see the details:

module.infra.null_resource.kustomization (remote-exec): Connecting to remote host via SSH...
module.infra.null_resource.kustomization (remote-exec):   Host: xxxx
module.infra.null_resource.kustomization (remote-exec):   User: root
module.infra.null_resource.kustomization (remote-exec):   Password: false
module.infra.null_resource.kustomization (remote-exec):   Private key: false
module.infra.null_resource.kustomization (remote-exec):   Certificate: false
module.infra.null_resource.kustomization (remote-exec):   SSH Agent: true
module.infra.null_resource.kustomization (remote-exec):   Checking Host Key: false
module.infra.null_resource.kustomization (remote-exec):   Target Platform: unix
module.infra.null_resource.kustomization (remote-exec): Connected!
module.infra.null_resource.kustomization: Still creating... [40s elapsed]
module.infra.null_resource.kustomization (remote-exec): + sed -i 's/^- |[0-9]\+$/- |/g' /var/post_install/kustomization.yaml
module.infra.null_resource.kustomization (remote-exec): sed: can't read /var/post_install/kustomization.yaml: Not a directory
╷
│ Error: remote-exec provisioner error
│
│   with module.infra.null_resource.kustomization,
│   on .terraform/modules/infra/init.tf line 288, in resource "null_resource" "kustomization":
│  288:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_148135017.sh": Process exited with status 2
╵
% ssh root@xxxx -p 60022
infra-control-plane-vnc:~ # less /var/post_install
infra-control-plane-vnc:~ # ls -l /var/post_install
-rw-r--r--. 1 root root 642 Jan  2 18:16 /var/post_install
infra-control-plane-vnc:~ #

The file /var/post_install should have been a directory instead.

mysticaltech commented 11 months ago

@otavio Please open a new issue with reproduceable steps, and please try to debug yourself too.