rancher / terraform-provider-rke

Terraform provider plugin for deploy kubernetes cluster by RKE(Rancher Kubernetes Engine)
Mozilla Public License 2.0
339 stars 151 forks source link

Cluster must have at least one etcd plane host #289

Closed a0s closed 3 years ago

a0s commented 3 years ago

I have issue similar to https://github.com/rancher/terraform-provider-rke/issues/139 and https://github.com/rancher/terraform-provider-rke/issues/9

Versions:

Terraform v0.14.8
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/hashicorp/time v0.7.0
+ provider registry.terraform.io/hashicorp/tls v3.1.0
+ provider registry.terraform.io/hetznercloud/hcloud v1.25.2
+ provider registry.terraform.io/rancher/rke v1.2.1

rke config

resource "rke_cluster" "this" {
  ssh_agent_auth   = true

  dynamic "nodes" {
    for_each = var.name_ip4_public

    content {
      address = var.name_ip4_public[nodes.key]
      internal_address = var.name_ip4_private[nodes.key]
      hostname_override = nodes.key
      node_name = nodes.key
      user = "root"
      role = local.roles[nodes.key]
      ssh_key = var.ssk_key_private
    }
  }

  upgrade_strategy {
    drain = true
    max_unavailable_worker = "100%"
    max_unavailable_controlplane = "100%"

    drain_input {
      ignore_daemon_sets = true
      delete_local_data = true
      timeout = 600
    }
  }

  ignore_docker_version = true
  cluster_name = "k8s"
  kubernetes_version = "v1.19.6-rancher1-1"

  addons = <<EOL
---
apiVersion: v1
kind: Secret
metadata:
  name: hcloud-csi
  namespace: kube-system
stringData:
  token: ${var.token}
EOL

  addons_include = [
    "https://raw.githubusercontent.com/hetznercloud/csi-driver/v1.5.1/deploy/kubernetes/hcloud-csi.yml",
  ]

  ingress {
    provider = "none"
  }

  network {
    plugin = "calico"
  }
}

My cloud init is

#cloud-config
packages:
  - apt-transport-https
  - ca-certificates
  - curl
  - gnupg-agent
  - software-properties-common

write_files:
  - path: /etc/sysctl.d/enabled_ipv4_forwarding.conf
    content: |
      net.ipv4.conf.all.forwarding=1
groups:
  - docker

runcmd:
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
  - add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  - apt-get update -y
  - apt-get install -y docker-ce docker-ce-cli containerd.io
  - systemctl start docker
  - systemctl enable docker

system_info:
  default_user:
    groups: [docker]

After it i am waiting sockets in null_resource

resource "null_resource" "waiter" {
  for_each = {
  for server in var.servers : server.name => server
  }

  depends_on = [hcloud_server_network.this]

  provisioner "remote-exec" {
    connection {
      type = "ssh"
      host = hcloud_server.this[each.key].ipv4_address
      user = "root"
      private_key = tls_private_key.this.private_key_pem
      timeout = var.waiting_for_network
    }

    inline = [
      "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done;",
      "while [ ! -S /var/run/docker.sock ]; do echo 'Waiting for docker.sock...'; sleep 1; done;",
      "while [ ! -S /run/containerd/containerd.sock ]; do echo 'Waiting for containerd.sock...'; sleep 1; done;",
    ]
  }
}

The main part of log

....
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"]: Still creating... [1m10s elapsed]
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"] (remote-exec): Waiting for cloud-init...
module.terraform-hetzner-kubernetes.module.hcloud.null_resource.waiter["k8s-nbg1-master-1"]: Creation complete after 1m16s [id=8937222847915215564]

Error:
============= RKE outputs ==============
time="2021-03-18T13:36:25+03:00" level=info msg="[rke_provider] rke cluster changed arguments: map[addons:true addons_include:true cluster_name:true ingress:true kubernetes_version:true network:true nodes:true ssh_agent_auth:true]"
time="2021-03-18T13:36:25+03:00" level=info msg="Creating RKE cluster..."
time="2021-03-18T13:36:25+03:00" level=info msg="Initiating Kubernetes cluster"
time="2021-03-18T13:36:25+03:00" level=info msg="[dialer] Setup tunnel for host [135.181.201.197]"
time="2021-03-18T13:36:25+03:00" level=info msg="[dialer] Setup tunnel for host [135.181.97.222]"
time="2021-03-18T13:36:27+03:00" level=warning msg="Failed to set up SSH tunneling for host [135.181.201.197]: Can't retrieve Docker Info: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2021-03-18T13:36:30+03:00" level=warning msg="Failed to set up SSH tunneling for host [135.181.97.222]: Can't retrieve Docker Info: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2021-03-18T13:36:30+03:00" level=warning msg="Removing host [135.181.201.197] from node lists"
time="2021-03-18T13:36:30+03:00" level=warning msg="Removing host [135.181.97.222] from node lists"
time="2021-03-18T13:36:30+03:00" level=warning msg="[state] can't fetch legacy cluster state from Kubernetes: Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) []"
time="2021-03-18T13:36:30+03:00" level=info msg="[certificates] Generating CA kubernetes certificates"
time="2021-03-18T13:36:30+03:00" level=info msg="[certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates"
time="2021-03-18T13:36:30+03:00" level=info msg="[certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates"
time="2021-03-18T13:36:30+03:00" level=info msg="[certificates] Generating Kubernetes API server certificates"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Service account token key"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Kube Controller certificates"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Kube Scheduler certificates"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Kube Proxy certificates"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Node certificate"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating admin certificates and kubeconfig"
time="2021-03-18T13:36:31+03:00" level=info msg="[certificates] Generating Kubernetes API server proxy client certificates"
time="2021-03-18T13:36:32+03:00" level=info msg="Successfully Deployed state file at [/Users/a0s/my/oh_my_product_infra/cluster/terraform-provider-rke-tmp-174243540/cluster.rkestate]"
time="2021-03-18T13:36:32+03:00" level=info msg="Building Kubernetes cluster"

Failed running cluster err:Cluster must have at least one etcd plane host: please specify one or more etcd in cluster config
========================================

What else i can check? How to emulate this step (Can't retrieve Docker Info: error during connect: Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info\": Failed to dial ssh using address) in terraform?

a0s commented 3 years ago

Seems it was a race condition. Making waiting (null_resource -> rke_cluster) more explicitly fix this issue.