Adding a worker doesn't add a node in RKE cluster

steffansluis commented 4 years ago

This might be a bug in the RKE provider, however since I am not using that directly I figured it makes sense to report it here (first). Using the following config:

data "openstack_images_image_v2" "ubuntu" {
  name = "Ubuntu-18.04"
  most_recent = true
}

resource "openstack_compute_keypair_v2" "keypair" {
  name = "my-application-keypair-${var.environment}"
}

module "rke" {
  cluster_name       = "my-application-${var.environment}"
  source             = "remche/rke/openstack"
  version            = "0.5.4"
  image_name         = data.openstack_images_image_v2.ubuntu.name
  public_net_name    = "external"
  master_flavor_name = "m1.medium"
  worker_flavor_name = "m1.large"
  os_auth_url        = "https://myopenstackprovider.com:5000"
  os_password        = var.os_password
  edge_count         = 0
  worker_count       = 4
  master_count       = 1
  use_ssh_agent      = true
  ssh_keypair_name   = openstack_compute_keypair_v2.keypair.name
  master_labels      = { "node-role.kubernetes.io/master" = "true" }
  edge_labels        = { "node-role.kubernetes.io/edge" = "true" }
  user_data_file     = "cloud-init.yaml"
  system_user        = "ubuntu"
  nodes_config_drive = true
  deploy_traefik = true
  deploy_nginx = false
}

When I increase worker_count to 5 and do terraform apply -auto-approve, it spins up a new instance on my Openstack provider, however the instance does not register as a node with the RKE cluster that is already running on the existing instances. This used to be the case when I still used 0.4.2 of this provider, but is no longer the case with 0.5.4. I've tested on two separate existing clusters, both successfully create the new instance on Openstack but fail to recognize the new node. In both cases, the apply gets interrupted with:

time="2020-10-05T14:00:49+02:00" level=error msg="Failed to upgrade hosts: my-application-staging-worker-004 with error [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [192.168.42.42]: Get http://localhost:10248/healthz: Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1005 12:00:45.096391   25275 server.go:274] failed to run Kubelet: could not init cloud provider \"openstack\": Authentication failed]"                                                                                                                                                                 

Failed running cluster err:[workerPlane] Failed to upgrade Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [192.168.42.42]: Get http://localhost:10248/healthz: Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1005 12:00:45.096391   25275 server.go:274] failed to run Kubelet: could not init cloud provider "openstack": Authentication failed]                                   
========================================                                                    

on .terraform/modules/rke/modules/rke/main.tf line 54, in resource "rke_cluster" "cluster":             
54: resource "rke_cluster" "cluster" {

However, in both cases just retrying terraform apply -auto-approve eventually results in Apply complete! Resources: 1 added, 0 changed, 0 destroyed..

Terraform v0.13.2
+ provider registry.terraform.io/hashicorp/local v1.4.0
+ provider registry.terraform.io/hashicorp/null v2.1.2
+ provider registry.terraform.io/rancher/rke v1.1.2
+ provider registry.terraform.io/terraform-provider-openstack/openstack v1.32.0
+ provider registry.terraform.io/terraform-providers/openstack v1.32.0

remche commented 4 years ago

Hi, I feel this comes from a misusage of keypair/ssh agent. The module will create a key for you if you dont specify a ssh_keypair_name, why are you creating one first ?

In any case, if you create a new key, it wont be in ssh-agent and the module wont be able to connect to nodes. I'm surprised this code works even for creating cluster... Are you able to connect via ssh to newly created vm ?

steffansluis commented 4 years ago

There is no real reason I create the key pair first, but I don't think it should be a problem, I don't mind it being explicit rather than being implicitly created by the module. I did have to add the key to my SSH agent manually (during the wait_for_ssh period) to get the cluster to create (couldn't get it to work with use_ssh_agent = false using the ssh_keypair_name), and can connect to the cluster through SSH fine after that. The main reason why I have been doing it this way was because when I set it up I couldn't get it to work with (or didn't try) the other methods. I feel like this is still very strange behavior though, why shouldn't this work and why does the command complete successfully? I can give it another shot with use_ssh_agent = false, but I tried that about a week ago and I couldn't get it to create a cluster at all.

Edit: Without using the agent I get:

Error: timeout - last error: Error connecting to bastion: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

remche commented 4 years ago

I was not sure of your workflow but I agree this should definitely work. I did not manage to reproduce you issue, though. Can you provide a debug log and confirm that you can connect to the newly spawned node during the module.rke.rke_cluster.cluster creation ?

dhrp commented 3 years ago

I've tried to reproduce this issue and have found there to be something going on with the usage of TF_VAR_os_password (or not).

The key is also in @steffansluis, output. failed to run Kubelet: could not init cloud provider \"openstack\": Authentication failed]". This relates not to SSH, but to authentication to OpenStack. -- The cloud provider (driver) fails to authenticate.

Emperically I have found that if I have OS_PASSWORD set, the cluster terraform is able to connect to and start the machines. But TF_VAR_os_password is needed to be set to allow the cloud provider to work.

note: Terraform will use use the default OpenStack client, and therefore connect successfully if only OS_PASSWORD is set. That's why it then only fails when trying to use the cloud-provider.

remche commented 3 years ago

@dhrp thanks for pointing this out !

If you want to use cloud provider, you need to set os_auth_url and os_password TF variables. That's because we cant retrieve them from identity_auth_scope_v3 data source.

USAGE.md file already state that, but I might add something in the README.

@steffansluis did you set TF variables when trying to scale up cluster ?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

remche / terraform-openstack-rke

Adding a worker doesn't add a node in RKE cluster #76