rancher / terraform-provider-rancher2

Terraform Rancher2 provider
https://www.terraform.io/docs/providers/rancher2/
Mozilla Public License 2.0
253 stars 216 forks source link

[BUG] Custom Cluster containing additional parameters makes node registration stuck #1348

Open gthieleb opened 1 month ago

gthieleb commented 1 month ago

Rancher Server Setup

Rancher v2.8.3 Dashboard v2.8.3 Helm v2.16.8-rancher2 Machine v0.15.0-rancher110

Information about the Cluster

User Information

Provider Information

Terraform v1.8.1
on linux_amd64
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.4
+ provider registry.terraform.io/hashicorp/helm v2.13.1
+ provider registry.terraform.io/hashicorp/kubernetes v2.29.0
+ provider registry.terraform.io/hashicorp/local v2.5.1
+ provider registry.terraform.io/hashicorp/random v3.6.1
+ provider registry.terraform.io/hashicorp/tls v4.0.5
+ provider registry.terraform.io/hetznercloud/hcloud v1.47.0
+ provider registry.terraform.io/integrations/github v6.2.1
+ provider registry.terraform.io/loafoe/ssh v2.7.0
+ provider registry.terraform.io/rancher/rancher2 v4.1.0
+ provider registry.terraform.io/valodim/desec v0.3.0

Describe the bug

Creating a custom cluster with custom argument (cloud-provider-name: external) makes node registration stuck

To Reproduce

  1. Create a custom cluster with tf cluster_v2 resource:

    resource "rancher2_cluster_v2" "workload" {
    provider = rancher2.admin
    
    name               = var.workload_cluster_name
    kubernetes_version = var.workload_kubernetes_version
    labels             = var.cluster_labels
    rke_config {
    # rke options: https://docs.rke2.io/advanced
    machine_global_config = <<-EOT
      cni: calico
      cloud-provider-name: external
    EOT
    }
    }
  2. Create a node and apply registration command

curl -fL https://rancher-ci.my.example.host/system-agent-install.sh | sudo  sh -s - --server https://rancher-ci.internal.example.com --label 'cattle.io/os=linux' --token ******************************************** --etcd --controlplane --worker

Actual Result

The message on the cluster:

configuring bootstrap node(s) custom-3c45f26babb8: waiting for cluster agent to connect 

The message in the provisioning tab:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for probes: calico
[INFO ] configuring bootstrap node(s) custom-3c45f26babb8: waiting for cluster agent to connect

The error message. (Further processing stops after "Tunnel authorizer set Kubelet Port 10250"):

journalctl -u rke2-server -f
May 12 09:04:29 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:29Z" level=error msg="error syncing 'kube-system/rke2-coredns': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-coredns\" not found, requeuing"
May 12 09:04:30 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:30Z" level=info msg="Active TLS secret kube-system/rke2-serving (ver=298) (count 11): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-157.90.152.100:157.90.152.100 listener.cattle.io/cn-2a01_4f8_c013_3d85__1-3fc80b:2a01:4f8:c013:3d85::1 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-ci-pool1-node1-workload-cluster:ci-pool1-node1-workload-cluster listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=5290F5FC27DEBA3486700F1AC97B98A35CDB5CF2]"
May 12 09:04:30 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:30Z" level=info msg="Labels and annotations have been set successfully on node: ci-pool1-node1-workload-cluster"
May 12 09:04:31 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:31Z" level=error msg="error syncing 'kube-system/rke2-ingress-nginx': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-ingress-nginx\" not found, requeuing"
May 12 09:04:32 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:32Z" level=error msg="error syncing 'kube-system/rke2-metrics-server': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-metrics-server\" not found, requeuing"
May 12 09:04:33 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:33Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller-crd': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller-crd\" not found, requeuing"
May 12 09:04:34 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:34Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller\" not found, requeuing"
May 12 09:04:34 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:34Z" level=error msg="error syncing 'kube-system/rke2-snapshot-validation-webhook': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-validation-webhook\" not found, requeuing"
May 12 09:04:37 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:04:37Z" level=info msg="Adding node ci-pool1-node1-workload-cluster-8351d36a etcd status condition"
May 12 09:05:15 ci-pool1-node1-workload-cluster rke2[2628]: time="2024-05-12T09:05:15Z" level=info msg="Tunnel authorizer set Kubelet Port 10250"

Error message (grep by error) see attached log. error.log

Expected Result

Without applying any additional arguments using the cluster v2 resource (recreating the cluster and the control plane) the registration wents fine:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for probes: calico
[INFO ] configuring bootstrap node(s) custom-b661aa9bfee4: waiting for cluster agent to connect
[INFO ] custom-b661aa9bfee4
[INFO ] non-ready bootstrap machine(s) custom-b661aa9bfee4 and join url to be available on bootstrap node
[INFO ] waiting for join url to be available on bootstrap node
[INFO ] provisioning done

journalctl -u rke2-server log: registration-succeed.log

resource "rancher2_cluster_v2" "workload" {
  provider = rancher2.admin

  name               = var.workload_cluster_name
  kubernetes_version = var.workload_kubernetes_version
  labels             = var.cluster_labels
}

Screenshots

Additional context

The cloud-provider-name: external argument should be used for registration of a cluster in hetzner using the hccm (hetzner-cloud-controller-manager): https://github.com/hetznercloud/hcloud-cloud-controller-manager