principekiss commented 2 years ago

Rancher Server Setup

Rancher version: v2.6.7
Installation option: Helm Chart
- AKS
- Kubernetes: v1.23.8
Proxy/Cert Details:
- ingress-nginx: v4.2.3
- cert-manager: v1.9.1

Information about the Cluster

Kubernetes version: v1.23.8-rancher1-1
Cluster Type (Local/Downstream): Downstream
- Infrastructure Provider: Azure

User Information

What is the role of the user logged in? Admin/Cluster

Describe the bug When creating the downstream RKE cluster in Azure using node pools (1 master pool with etcd+control plane roles and 3 worker pools), the master gets created, then the load balancer is also being created from the user addon. The master gets registered, and added to the load balancer backend pool. Then the 1st worker is also registered, but it is very likely that the load balancer is not active in the virtual network yet, giving the worker a gateway that works, allowing it to register.

Meanwhile, the load balancer finally gets active, and any new workers don't get the load balancer as the gateway, breaking their registration. The other worker nodes get stuck in "Registering" state and any added node (masters and/or workers) through the Rancher UI scaling feature, gets stuck in "IP Resolved" until it times out and gets deleted.

So, logic would be that first of all, the load balancer should be created, and Rancher should actually wait/verify that it is active in the virtual network before it starts adding nodes (masters and/or workers).

And that is not being done, making me believe that there is a logic bug in Rancher itself.

To Reproduce

Create the main AKS Rancher cluster
Add peering between Rancher virtual network and RKE1 downstream virtual network
Create the downstream RKE1 cluster with user addon job for creating the external load balancer
Wait for the cluster to finish building successfully

Result Only initial master nodes and first worker node is registered into the Kubernetes cluster. The other worker nodes get stuck in "Registering" state and no additional nodes can be added using the Rancher UI, they get stuck in "IP resolved".

Expected Result All nodes are registered and I can scale up nodes through the Rancher UI.

Screenshots

Screenshot from 2022-09-05 13-57-07

rke-downstream-rg

Screenshot from 2022-09-05 14-00-15

Screenshot from 2022-09-05 14-00-58

Screenshot from 2022-09-05 14-01-03

Screenshot from 2022-09-05 14-01-18

Additional context The following code uses terraform to create the downstream RKE1 cluster with 1 master node pool (control plane+etcd) and 3 worker pools (system, kafka, and general) with an external load balancer from user addon job:

## Create Resource Group and Network
resource "azurerm_resource_group" "rke" {
  name     = "${var.rke_name_prefix}-rg"
  location = var.azure_region
}

resource "azurerm_virtual_network" "rke" {
  name                = "${var.rke_name_prefix}-vnet"
  address_space       = var.rke_address_space
  location            = var.azure_region
  resource_group_name = azurerm_resource_group.rke.name
}

resource "azurerm_subnet" "rke" {
  name                 = "${var.rke_name_prefix}-subnet"
  resource_group_name  = azurerm_resource_group.rke.name
  virtual_network_name = azurerm_virtual_network.rke.name
  address_prefixes     = var.rke_address_prefixes
}

## Create Peering Between Rancher and downstream RKE1 Virtual Networks 

resource "azurerm_virtual_network_peering" "rancher" {
  name                       = "rancher-vnet-peering"
  resource_group_name        = azurerm_resource_group.rke.name
  virtual_network_name       = azurerm_virtual_network.rke.name
  remote_virtual_network_id  = var.rancher_vnet_id
}

resource "azurerm_virtual_network_peering" "rke" {
  name                       = "rke-vnet-peering"
  resource_group_name        = var.rancher_rg_name
  virtual_network_name       = var.rancher_vnet_name
  remote_virtual_network_id  = azurerm_virtual_network.rke.id
}

## Create Network Security Groups

resource "azurerm_network_security_group" "worker" {
  name                 = "worker-nsg"
  location             = azurerm_resource_group.rke.location
  resource_group_name  = azurerm_resource_group.rke.name

  security_rule {
    name                       = "SSH_IN"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 22
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "CanalOverlay_IN"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Udp"
    source_port_range          = "*"
    destination_port_range     = 8472
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "CanalProbe_IN"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 9099
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "IngressProbe_IN"
    priority                   = 130
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 10254
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "NodePort_UDP_IN"
    priority                   = 140
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Udp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "NodePort_TCP_IN"
    priority                   = 150
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HttpsIngress_IN"
    priority                   = 160
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 443
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HttpIngress_IN"
    priority                   = 170
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 80
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DockerDaemon_IN"
    priority                   = 180
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 2376
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "Metrics_IN"
    priority                   = 190
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 10250
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "KubeAPI_IN"
    priority                   = 200
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 6443
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "control_plane" {
  name                 = "control-plane-nsg"
  location             = azurerm_resource_group.rke.location
  resource_group_name  = azurerm_resource_group.rke.name

  security_rule {
    name                       = "SSH_IN"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 22
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "CanalOverlay_IN"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Udp"
    source_port_range          = "*"
    destination_port_range     = 8472
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "CanalProbe_IN"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 9099
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "IngressProbe_IN"
    priority                   = 130
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 10254
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "Etcd_IN"
    priority                   = 140
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "2379-2380"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DockerDaemon_IN"
    priority                   = 170
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 2376
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "Metrics_IN"
    priority                   = 180
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 10250
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HttpsIngress_IN"
    priority                   = 190
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 443
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HttpIngress_IN"
    priority                   = 200
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 80
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "KubeAPI_IN"
    priority                   = 210
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = 6443
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "NodePort_UDP_IN"
    priority                   = 220
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Udp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "NodePort_TCP_IN"
    priority                   = 230
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

## Create Availability Sets

resource "azurerm_availability_set" "control_plane" {
  name                 = "control-plane-availset"
  location             = azurerm_resource_group.rke.location
  resource_group_name  = azurerm_resource_group.rke.name
}

resource "azurerm_availability_set" "worker" {
  name                 = "worker-availset"
  location             = azurerm_resource_group.rke.location
  resource_group_name  = azurerm_resource_group.rke.name
}

## Create downstream RKE1 Cluster

resource "rancher2_cluster" "rke" {
  name =  "${var.rke_name_prefix}-cluster"
  description = "Downstream RKE Cluster"
  cluster_auth_endpoint {
    enabled = true
  }

  rke_config {
    ignore_docker_version  = false
    kubernetes_version     = "v${var.kubernetes_version}-rancher1-1"

    authentication {
      strategy = "x509|webhook"
    }

    network {
      plugin = "canal"
    }

    ingress {
      provider         = "nginx"
      network_mode     = "none"
      http_port        = 8080
      https_port       = 8443
      default_backend  = false
    }

    services {
      etcd {
        backup_config {
          enabled         = true
          interval_hours  = 12
          retention       = 6
        }

        creation   = "12h"
        retention  = "72h"
        snapshot   = false
      }

      kube_api {
        pod_security_policy      = false
        service_node_port_range  = "30000-32767"
      }
    }

    addons = "${file("${path.module}/addons/loadbalancer.yaml")}"

    cloud_provider {
      name = "azure"
      azure_cloud_provider {
        aad_client_id                   = azuread_application.app.application_id
        aad_client_secret               = azuread_service_principal_password.auth.value
        subscription_id                 = data.azurerm_subscription.subscription.subscription_id
        tenant_id                       = data.azurerm_subscription.subscription.tenant_id
        load_balancer_sku               = "standard"
        subnet_name                     = azurerm_subnet.rke.name
        vnet_name                       = azurerm_virtual_network.rke.name
        resource_group                  = azurerm_resource_group.rke.name
        use_instance_metadata           = true
        vm_type                         = "standard"
        primary_availability_set_name   = azurerm_availability_set.worker.name
      }
    }
  }

  provider = rancher2.admin
}

## Create Node Templates

resource "rancher2_node_template" "control_plane" {
  name                 = "control-plane-template"
  description          = "Node Template for RKE Cluster on Azure"
  cloud_credential_id  = rancher2_cloud_credential.cloud_credential.id
  engine_install_url   = "https://releases.rancher.com/install-docker/20.10.sh"
  azure_config {
    managed_disks         = var.control_plane_template.managed_disks
    location              = azurerm_resource_group.rke.location
    image                 = var.control_plane_template.image
    size                  = var.control_plane_template.size
    storage_type          = var.control_plane_template.storage_type
    resource_group        = azurerm_resource_group.rke.name
    no_public_ip          = var.control_plane_template.no_public_ip
    subnet                = azurerm_subnet.rke.name
    vnet                  = azurerm_virtual_network.rke.name
    nsg                   = azurerm_network_security_group.control_plane.name
    availability_set      = azurerm_availability_set.control_plane.name
  }

  provider = rancher2.admin
}

resource "rancher2_node_template" "system" {
  name                 = "system-template"
  description          = "Node Template for RKE Cluster on Azure"
  cloud_credential_id  = rancher2_cloud_credential.cloud_credential.id
  engine_install_url   = "https://releases.rancher.com/install-docker/20.10.sh"
  azure_config {
    managed_disks         = var.system_template.managed_disks
    location              = azurerm_resource_group.rke.location
    image                 = var.system_template.image
    size                  = var.system_template.size
    storage_type          = var.system_template.storage_type
    resource_group        = azurerm_resource_group.rke.name
    no_public_ip          = var.system_template.no_public_ip
    subnet                = azurerm_subnet.rke.name
    vnet                  = azurerm_virtual_network.rke.name
    nsg                   = azurerm_network_security_group.worker.name
    availability_set      = azurerm_availability_set.worker.name
  }

  provider = rancher2.admin
}

resource "rancher2_node_template" "general" {
  name                 = "general-template"
  description          = "Node Template for RKE Cluster on Azure"
  cloud_credential_id  = rancher2_cloud_credential.cloud_credential.id
  engine_install_url   = "https://releases.rancher.com/install-docker/20.10.sh"
  azure_config {
    managed_disks         = var.general_template.managed_disks
    location              = azurerm_resource_group.rke.location
    image                 = var.general_template.image
    size                  = var.general_template.size
    storage_type          = var.general_template.storage_type
    resource_group        = azurerm_resource_group.rke.name
    no_public_ip          = var.system_template.no_public_ip
    subnet                = azurerm_subnet.rke.name
    vnet                  = azurerm_virtual_network.rke.name
    nsg                   = azurerm_network_security_group.worker.name
    availability_set      = azurerm_availability_set.worker.name
  }

  provider = rancher2.admin
}

resource "rancher2_node_template" "kafka" {
  name                 = "kafka-template"
  description          = "Node Template for RKE Cluster on Azure"
  cloud_credential_id  = rancher2_cloud_credential.cloud_credential.id
  engine_install_url   = "https://releases.rancher.com/install-docker/20.10.sh"
  azure_config {
    managed_disks         = var.kafka_template.managed_disks
    location              = azurerm_resource_group.rke.location
    image                 = var.kafka_template.image
    size                  = var.kafka_template.size
    storage_type          = var.kafka_template.storage_type
    resource_group        = azurerm_resource_group.rke.name
    no_public_ip          = var.kafka_template.no_public_ip
    subnet                = azurerm_subnet.rke.name
    vnet                  = azurerm_virtual_network.rke.name
    nsg                   = azurerm_network_security_group.worker.name
    availability_set      = azurerm_availability_set.worker.name
  }

  provider = rancher2.admin
}

## Create Node Pools

resource "rancher2_node_pool" "control_plane" {
  cluster_id        =  rancher2_cluster.rke.id
  name              = "control-plane-node-pool"
  hostname_prefix   = "control-plane"
  node_template_id  = rancher2_node_template.control_plane.id
  quantity          = var.control_plane_pool.quantity
  control_plane     = true
  etcd              = true
  worker            = false

  provider = rancher2.admin
}

resource "rancher2_node_pool" "system" {
  cluster_id        =  rancher2_cluster.rke.id
  name              = "system-node-pool"
  hostname_prefix   = "system"
  node_template_id  = rancher2_node_template.system.id
  quantity          = var.system_pool.quantity
  control_plane     = false
  etcd              = false
  worker            = true

  provider = rancher2.admin
}

resource "rancher2_node_pool" "general" {
  cluster_id        =  rancher2_cluster.rke.id
  name              = "general-pool"
  hostname_prefix   = "general"
  node_template_id  = rancher2_node_template.general.id
  quantity          = var.general_pool.quantity
  control_plane     = false
  etcd              = false
  worker            = true

  provider = rancher2.admin
}

resource "rancher2_node_pool" "kafka" {
  cluster_id        =  rancher2_cluster.rke.id
  name              = "kafka-node-pool"
  hostname_prefix   = "kafka"
  node_template_id  = rancher2_node_template.kafka.id
  quantity          = var.kafka_pool.quantity
  control_plane     = false
  etcd              = false
  worker            = true

  provider = rancher2.admin
}

Addon used to expose the ingress controller using a cloud load balancer:

# external load balancer

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  externalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  - name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
  type: LoadBalancer

Getting nodes and pods with the Rancher CLI:

tuxicorn@pop-os:~/rancher-project$ rancher kubectl get nodes -o wide
NAME             STATUS   ROLES               AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control-plane1   Ready    controlplane,etcd   28m   v1.23.8   10.100.0.4    <none>        Ubuntu 20.04.4 LTS   5.15.0-1017-azure   docker://20.10.12
kafka1           Ready    worker              25m   v1.23.8   10.100.0.5    <none>        Ubuntu 20.04.4 LTS   5.15.0-1017-azure   docker://20.10.12

tuxicorn@pop-os:~/rancher-project$  rancher nodes
ID                NAME             STATE         POOL            DESCRIPTION
c-2n949:m-2qpdg   general2         registering   general         
c-2n949:m-m8g9b   system1          registering   system          
c-2n949:m-mvc5j   kafka1           active        kafka           
c-2n949:m-zksj9   control-plane1   active        control-plane

tuxicorn@pop-os:~/rancher-project$  rancher kubectl get pod --all-namespaces -o wide
NAME             STATUS   ROLES               AGE   VERSION
control-plane1   Ready    controlplane,etcd   28m   v1.23.8
kafka1           Ready    worker              25m   v1.23.8
diclonius@pop-os:~/rancher-in-a-box$ rancher kubectl get nodes -o wide
NAME             STATUS   ROLES               AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control-plane1   Ready    controlplane,etcd   28m   v1.23.8   10.100.0.4    <none>        Ubuntu 20.04.4 LTS   5.15.0-1017-azure   docker://20.10.12
kafka1           Ready    worker              25m   v1.23.8   10.100.0.5    <none>        Ubuntu 20.04.4 LTS   5.15.0-1017-azure   docker://20.10.12
diclonius@pop-os:~/rancher-in-a-box$ rancher kubectl get pod --all-namespaces -o wide
NAMESPACE             NAME                                      READY   STATUS      RESTARTS      AGE   IP           NODE             NOMINATED NODE   READINESS GATES
cattle-fleet-system   fleet-agent-7f8ddd996f-njsxr              1/1     Running     0             24m   10.42.1.8    kafka1           <none>           <none>
cattle-system         cattle-cluster-agent-c84f4cb8b-8pkkh      1/1     Running     0             24m   10.42.0.6    control-plane1   <none>           <none>
cattle-system         cattle-cluster-agent-c84f4cb8b-gqsxf      1/1     Running     6 (25m ago)   28m   10.42.0.5    control-plane1   <none>           <none>
cattle-system         cattle-node-agent-k2279                   1/1     Running     0             26m   10.100.0.5   kafka1           <none>           <none>
cattle-system         cattle-node-agent-zxhqs                   1/1     Running     0             28m   10.100.0.4   control-plane1   <none>           <none>
cattle-system         kube-api-auth-8h7xw                       1/1     Running     0             28m   10.100.0.4   control-plane1   <none>           <none>
ingress-nginx         ingress-nginx-admission-create-l77cb      0/1     Completed   0             28m   10.42.0.4    control-plane1   <none>           <none>
ingress-nginx         ingress-nginx-admission-patch-b9fkm       0/1     Completed   2             28m   10.42.0.3    control-plane1   <none>           <none>
ingress-nginx         nginx-ingress-controller-rlmqg            1/1     Running     0             26m   10.42.1.3    kafka1           <none>           <none>
kube-system           calico-kube-controllers-fc7fcb565-tb27v   1/1     Running     0             29m   10.42.0.2    control-plane1   <none>           <none>
kube-system           canal-t2v7m                               2/2     Running     0             26m   10.100.0.5   kafka1           <none>           <none>
kube-system           canal-tlqlv                               2/2     Running     0             29m   10.100.0.4   control-plane1   <none>           <none>
kube-system           coredns-548ff45b67-fghbj                  1/1     Running     0             29m   10.42.1.2    kafka1           <none>           <none>
kube-system           coredns-autoscaler-d5944f655-xr2fb        1/1     Running     0             29m   10.42.1.4    kafka1           <none>           <none>
kube-system           metrics-server-5c4895ffbd-pql4m           1/1     Running     0             29m   10.42.1.6    kafka1           <none>           <none>
kube-system           rke-coredns-addon-deploy-job-2ktnq        0/1     Completed   0             29m   10.100.0.4   control-plane1   <none>           <none>
kube-system           rke-ingress-controller-deploy-job-5fn5g   0/1     Completed   0             29m   10.100.0.4   control-plane1   <none>           <none>
kube-system           rke-metrics-addon-deploy-job-zh9zp        0/1     Completed   0             29m   10.100.0.4   control-plane1   <none>           <none>
kube-system           rke-network-plugin-deploy-job-zp87f       0/1     Completed   0             29m   10.100.0.4   control-plane1   <none>           <none>
kube-system           rke-user-addon-deploy-job-cff55           0/1     Completed   0             28m   10.100.0.4   control-plane1   <none>

Provisioning Log for the cluster

provisioning.txt

/etc/resolv.conf

All nodes have the same DNS config.

nameserver 127.0.0.53
options edns0
search yaoknpc5pe5enpgbu3mm1u4sjg.ax.internal.cloudapp.net

rancher-agent container logs

Rancher agent logs of stuck nodes in "Registering" state.

INFO: Arguments: --no-register --only-write-certs --node-name system1 --server https://rancher.sauron.mordor.net --token REDACTED --ca-checksum 928a476fa0b0610ef46217292d51ac438f4ffa56ea3b155f240ee89ff4c1f31b

INFO: Environment: CATTLE_ADDRESS=10.100.0.6 CATTLE_AGENT_CONNECT=true CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=system1 CATTLE_SERVER=https://rancher.sauron.mordor.net CATTLE_TOKEN=REDACTED CATTLE_WRITE_CERT_ONLY=true

INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0 search u52yollp52yetgbxgli0ra2ssb.ax.internal.cloudapp.net
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
ERROR: https://rancher.gandalf.mordor.net/ping is not accessible (Failed to connect to rancher.sauron.mordor.net port 443: Connection timed out)

principekiss commented 2 years ago

EDIT

This morning, running the exact same code (used git status to verify), 2 worker pools registered successfully and 1 got stuck with the following status:

Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitilazed

Screenshot from 2022-08-19 09-17-44

Collected logs from containers

EDIT

Yesterday, with a new entire Rancher cluster and downstream RKE cluster creation, I've seen the same behaviors with the downstream cluster agents above and I was able to retrieve logs from the Rancher cluster controller pods with TRACE log level:

These lines are about the stuck nodes, system1 and general1, of the downstream cluster:

2022/08/22 14:45:36 [TRACE] dialerFactory: Skipping node [system1] for tunneling the cluster connection because nodeConditions are not as expected
2022/08/22 14:45:36 [TRACE] dialerFactory: Skipping node [general1] for tunneling the cluster connection because nodeConditions are not as expected

And in the last line of this screenshot (coming from the Rancher cluster controller logs I pasted above)

screen

principekiss commented 2 years ago

EDIT

I added the stuck worker nodes to the load balance backend pool manually, all of them got registered.

Screenshot from 2022-08-30 18-59-49

To me, this is look like a Rancher bug. Even after all nodes are registered, the scaling of nodes does not work because Rancher does not ask Azure to add nodes to the load balancer backend pool once they are created which makes the added nodes stuck in a "Provisioning" state. I assume that the network security groups are correct, I checked the open ports on the nodes.

Screenshot from 2022-08-30 19-19-57

This is because Rancher only adds the first worker node to the load balancer backend pool after the cluster is successfully created to meet the minimum requirement of a cluster with 1 worker node and the initial etcd+control plane nodes. After the cluster is active (as soon as the first worker node is registered) it stops adding other worker nodes in the load balancer backend pool and so, no scaling is possible when using an external load balancer.

I should be able to use the user-addon to create a service type LoadBalancer pointing to the ingress controller, have all my nodes registered, and node scaling working.

catherineluse commented 2 years ago

I believe Kubernetes is responsible for adding nodes to the backend pool, but it can only do that if the Azure cloud provider is enabled properly.

I'm not an expert in this area, but I was looking up the docs about the Azure cloud provider for RKE https://rancher.com/docs/rke/latest/en/config-options/cloud-providers/azure/#overriding-the-hostname and searching for anything that might be missing from your configuration. I found this in the linked doc:

Since the Azure node name must match the Kubernetes node name, you override the Kubernetes name on the node by setting the hostname_override for each node.

Maybe the problem is that the hostname_override has not been set.

principekiss commented 2 years ago

hostname_override

Hi, thanks for your answer! I already looked at this in the doc previously but that argument is not present in the rancher2 Terraform provider for the node_template resource and I haven't found any equivalent.

catherineluse commented 2 years ago

The hostname_override is in the cluster resource https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster#hostname_override . Basically, that value would have been configured in the cluster config file if you were provisioning the RKE cluster with the RKE CLI. And the cluster resource in terraform is the equivalent of that file

principekiss commented 2 years ago

The hostname_override is in the cluster resource https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster#hostname_override. Basically, that value would have been configured in the cluster config file if you were provisioning the RKE cluster with the RKE CLI. And the cluster resource in terraform is the equivalent of that file

Indeed, but I use node templates and node pools. I do not define nodes in the rancher2_cluster resource. I use node templates/pools for being able to use node scaling in the Rancher UI.

principekiss commented 2 years ago

EDIT

If I deploy the external load balancer only after all nodes are registered and the cluster is active, it adds all of them to the backend pool but then scaling does not work because it only adds nodes to the backend pool after they are registered. It should add the nodes directly after they are created not after they are registered because they only can be registered when they are added to the load balancer backend.

rancher / terraform-provider-rancher2

[BUG] Nodes are not added to the external load balancer backend pool after load balancer is active #972

Addon used to expose the ingress controller using a cloud load balancer:

Getting nodes and pods with the Rancher CLI:

Provisioning Log for the cluster

/etc/resolv.conf

rancher-agent container logs

EDIT

Collected logs from containers

EDIT

EDIT

EDIT