rancher / quickstart

380 stars 335 forks source link

Azure quickstart fails with errors regarding etcd #146

Closed gingters closed 2 years ago

gingters commented 3 years ago

Hi,

I tried running the azure quickstart. I created a service principal and configured the .tfvars file accordingly. However, when I call terraform apply --auto-approve it fails with the following error message (some names in paths are x'ed out).

Error: 
============= RKE outputs ==============
time="2020-12-11T11:52:43+01:00" level=info msg="Deleting RKE cluster..."
time="2020-12-11T11:52:43+01:00" level=info msg="Tearing down Kubernetes cluster"
time="2020-12-11T11:52:43+01:00" level=info msg="[dialer] Setup tunnel for host [20.52.52.111]"
time="2020-12-11T11:53:04+01:00" level=warning msg="Failed to set up SSH tunneling for host [20.52.52.111]: Can't retrieve Docker Info: error during connect: Get \"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/info\": Failed to dial ssh using address [20.52.52.111:22]: dial tcp 20.52.52.111:22: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."
time="2020-12-11T11:53:04+01:00" level=warning msg="Removing host [20.52.52.111] from node lists"
time="2020-12-11T11:56:12+01:00" level=info msg="[rke_provider] rke cluster changed arguments: map[cluster_name:true kubernetes_version:true nodes:true]"
time="2020-12-11T11:56:12+01:00" level=info msg="Creating RKE cluster..."
time="2020-12-11T11:56:12+01:00" level=info msg="Initiating Kubernetes cluster"
time="2020-12-11T11:56:12+01:00" level=info msg="[dialer] Setup tunnel for host [20.52.135.101]"
time="2020-12-11T11:56:13+01:00" level=info msg="Checking if container [cluster-state-deployer] is running on host [20.52.135.101], try #1"
time="2020-12-11T11:56:13+01:00" level=info msg="Pulling image [rancher/rke-tools:v0.1.66] on host [20.52.135.101], try #1"
time="2020-12-11T11:56:21+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:56:43+01:00" level=info msg="Starting container [cluster-state-deployer] on host [20.52.135.101], try #1"
time="2020-12-11T11:56:44+01:00" level=info msg="[state] Successfully started [cluster-state-deployer] container on host [20.52.135.101]"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating CA kubernetes certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Kubernetes API server certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Service account token key"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Kube Controller certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Kube Scheduler certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Kube Proxy certificates"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating Node certificate"
time="2020-12-11T11:56:44+01:00" level=info msg="[certificates] Generating admin certificates and kubeconfig"
time="2020-12-11T11:56:45+01:00" level=info msg="[certificates] Generating Kubernetes API server proxy client certificates"
time="2020-12-11T11:56:45+01:00" level=info msg="[certificates] Generating kube-etcd-10-0-0-4 certificate and key"
time="2020-12-11T11:56:45+01:00" level=info msg="Successfully Deployed state file at [C:\\Dev\\cst\\xxxx\\xxxx\\xxxx\\xxxx-rancher\\quickstart\\azure\\terraform-provider-rke-tmp-169182102/cluster.rkestate]"
time="2020-12-11T11:56:45+01:00" level=info msg="Building Kubernetes cluster"
time="2020-12-11T11:56:45+01:00" level=info msg="[dialer] Setup tunnel for host [20.52.135.101]"
time="2020-12-11T11:56:46+01:00" level=info msg="[network] Deploying port listener containers"
time="2020-12-11T11:56:46+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:56:49+01:00" level=info msg="Starting container [rke-etcd-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:56:50+01:00" level=info msg="[network] Successfully started [rke-etcd-port-listener] container on host [20.52.135.101]"
time="2020-12-11T11:56:50+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:56:53+01:00" level=info msg="Starting container [rke-cp-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:56:53+01:00" level=info msg="[network] Successfully started [rke-cp-port-listener] container on host [20.52.135.101]"
time="2020-12-11T11:56:53+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:56:56+01:00" level=info msg="Starting container [rke-worker-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:56:57+01:00" level=info msg="[network] Successfully started [rke-worker-port-listener] container on host [20.52.135.101]"
time="2020-12-11T11:56:57+01:00" level=info msg="[network] Port listener containers deployed successfully"
time="2020-12-11T11:56:57+01:00" level=info msg="[network] Running control plane -> etcd port checks"
time="2020-12-11T11:56:57+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:57:00+01:00" level=info msg="Starting container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:01+01:00" level=info msg="[network] Successfully started [rke-port-checker] container on host [20.52.135.101]"
time="2020-12-11T11:57:01+01:00" level=info msg="Removing container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:01+01:00" level=info msg="[network] Running control plane -> worker port checks"
time="2020-12-11T11:57:01+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:57:04+01:00" level=info msg="Starting container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:04+01:00" level=info msg="[network] Successfully started [rke-port-checker] container on host [20.52.135.101]"
time="2020-12-11T11:57:04+01:00" level=info msg="Removing container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:05+01:00" level=info msg="[network] Running workers -> control plane port checks"
time="2020-12-11T11:57:05+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:57:08+01:00" level=info msg="Starting container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:08+01:00" level=info msg="[network] Successfully started [rke-port-checker] container on host [20.52.135.101]"
time="2020-12-11T11:57:08+01:00" level=info msg="Removing container [rke-port-checker] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:08+01:00" level=info msg="[network] Checking KubeAPI port Control Plane hosts"
time="2020-12-11T11:57:08+01:00" level=info msg="[network] Removing port listener containers"
time="2020-12-11T11:57:08+01:00" level=info msg="Removing container [rke-etcd-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:09+01:00" level=info msg="[remove/rke-etcd-port-listener] Successfully removed container on host [20.52.135.101]"
time="2020-12-11T11:57:09+01:00" level=info msg="Removing container [rke-cp-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:09+01:00" level=info msg="[remove/rke-cp-port-listener] Successfully removed container on host [20.52.135.101]"
time="2020-12-11T11:57:09+01:00" level=info msg="Removing container [rke-worker-port-listener] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:10+01:00" level=info msg="[remove/rke-worker-port-listener] Successfully removed container on host [20.52.135.101]"
time="2020-12-11T11:57:10+01:00" level=info msg="[network] Port listener containers removed successfully"
time="2020-12-11T11:57:10+01:00" level=info msg="[certificates] Deploying kubernetes certificates to Cluster nodes"
time="2020-12-11T11:57:10+01:00" level=info msg="Checking if container [cert-deployer] is running on host [20.52.135.101], try #1"
time="2020-12-11T11:57:10+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:57:13+01:00" level=info msg="Starting container [cert-deployer] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:14+01:00" level=info msg="Checking if container [cert-deployer] is running on host [20.52.135.101], try #1"
time="2020-12-11T11:57:19+01:00" level=info msg="Checking if container [cert-deployer] is running on host [20.52.135.101], try #1"
time="2020-12-11T11:57:19+01:00" level=info msg="Removing container [cert-deployer] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:19+01:00" level=info msg="[reconcile] Rebuilding and updating local kube config"
time="2020-12-11T11:57:19+01:00" level=info msg="Successfully Deployed local admin kubeconfig at [C:\\Dev\\cst\\xxxx\\xxxx\\xxxx\\xxxx-rancher\\quickstart\\azure\\terraform-provider-rke-tmp-169182102/kube_config_cluster.yml]"
time="2020-12-11T11:57:21+01:00" level=info msg="[certificates] Successfully deployed kubernetes certificates to Cluster nodes"
time="2020-12-11T11:57:21+01:00" level=info msg="[file-deploy] Deploying file [/etc/kubernetes/audit-policy.yaml] to node [20.52.135.101]"
time="2020-12-11T11:57:21+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T11:57:24+01:00" level=info msg="Starting container [file-deployer] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:25+01:00" level=info msg="Successfully started [file-deployer] container on host [20.52.135.101]"
time="2020-12-11T11:57:25+01:00" level=info msg="Waiting for [file-deployer] container to exit on host [20.52.135.101]"
time="2020-12-11T11:57:25+01:00" level=info msg="Waiting for [file-deployer] container to exit on host [20.52.135.101]"
time="2020-12-11T11:57:25+01:00" level=info msg="Removing container [file-deployer] on host [20.52.135.101], try #1"
time="2020-12-11T11:57:25+01:00" level=info msg="[remove/file-deployer] Successfully removed container on host [20.52.135.101]"
time="2020-12-11T11:57:25+01:00" level=info msg="[/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes"
time="2020-12-11T11:57:25+01:00" level=info msg="[reconcile] Reconciling cluster state"
time="2020-12-11T11:57:25+01:00" level=info msg="[reconcile] This is newly generated cluster"
time="2020-12-11T11:57:25+01:00" level=info msg="Pre-pulling kubernetes images"
time="2020-12-11T11:57:25+01:00" level=info msg="Pulling image [rancher/hyperkube:v1.19.3-rancher1] on host [20.52.135.101], try #1"
time="2020-12-11T12:00:19+01:00" level=info msg="Image [rancher/hyperkube:v1.19.3-rancher1] exists on host [20.52.135.101]"
time="2020-12-11T12:00:19+01:00" level=info msg="Kubernetes images pulled successfully"
time="2020-12-11T12:00:19+01:00" level=info msg="[etcd] Building up etcd plane.."
time="2020-12-11T12:00:19+01:00" level=info msg="Image [rancher/rke-tools:v0.1.66] exists on host [20.52.135.101]"
time="2020-12-11T12:01:09+01:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [20.52.135.101]: Cannot connect to the Docker daemon at npipe:////./pipe/docker_engine. Is the docker daemon running?"
time="2020-12-11T12:01:10+01:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [20.52.135.101]: Error response from daemon: Conflict. The container name \"/etcd-fix-perm\" is already in use by container \"6b2133859f518f3912419a19092c643ea4208d7b9c1ac953e7bcdec287c50d9f\". You have to remove (or rename) that container to be able to reuse that name."
time="2020-12-11T12:01:10+01:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [20.52.135.101]: Error response from daemon: Conflict. The container name \"/etcd-fix-perm\" is already in use by container \"6b2133859f518f3912419a19092c643ea4208d7b9c1ac953e7bcdec287c50d9f\". You have to remove (or rename) that container to be able to reuse that name."

Failed running cluster err:[etcd] Failed to bring up Etcd Plane: Failed to create [etcd-fix-perm] container on host [20.52.135.101]: Failed to create Docker container [etcd-fix-perm] on host [20.52.135.101]: Error response from daemon: Conflict. The container name "/etcd-fix-perm" is already in use by container "6b2133859f518f3912419a19092c643ea4208d7b9c1ac953e7bcdec287c50d9f". You have to remove (or rename) that container to be able to reuse that name.
========================================

  on ..\rancher-common\rke.tf line 4, in resource "rke_cluster" "rancher_cluster":
   4: resource "rke_cluster" "rancher_cluster" {

How can this be solved?

gingters commented 3 years ago

Thanks to the people in the Slack, we could figure out that the Standard_LRS storage on the VMs is an issue, because they are too slow. When I changed the infra.tf file to use Premium_LRS storage disks for the VMs, the system came up in one go.

So I'd like to suggest to change these values in the template by default, so that others don't run into that issue. And yes, its a bit more expensive for the person trying it out on Azure, but if you want to try out rancher you really want that to work on the first go, and not troubleshoot such a thing.

anttivikman commented 3 years ago

Can confirm that swapping to Premium LRS fixed the issue.