rancher / terraform-provider-rancher2

Terraform Rancher2 provider
https://www.terraform.io/docs/providers/rancher2/
Mozilla Public License 2.0
253 stars 216 forks source link

[BUG] RKE1 clusters fail to provision with v4.x of tfp #1258

Closed git-ival closed 6 months ago

git-ival commented 8 months ago

Rancher Server Setup

Information about the Cluster

User Information

Provider Information

Describe the bug

Cannot provision RKE1 nodedriver clusters using v4.x of the terraform provider.

To Reproduce

  1. Setup Local Rancher cluster
  2. Provision an RKE1 nodedriver cluster via rancher2 tfp

Actual Result

Cluster fails to provision with one of the following errors:

[workerPlane] Failed to bring up Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [52.8.134.174]: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused, log: time="2023-10-25T20:17:09Z" level=info msg="Start cri-dockerd grpc backend"]
[controlPlane] Failed to upgrade Control Plane: [[[controlplane] Error getting node ival-sparrow-rke1-ntrke1-workspace-00001-pool0-node1: "ival-sparrow-rke1-ntrke1-workspace-00001-pool0-node1" not found]]

Expected Result

Cluster provisions successfully

Screenshots

image

Additional Context

main.tf:

...
resource "rancher2_cloud_credential" "this" {
  name = var.cred_name
  amazonec2_credential_config {
    access_key     = var.credential_config.access_key
    secret_key     = var.credential_config.secret_key
    default_region = var.credential_config.region
  }
}

resource "rancher2_node_template" "this" {
  name                  = var.template_name
  cloud_credential_id   = rancher2_cloud_credential.this.id
  engine_storage_driver = "overlay2"

  amazonec2_config {
    ami                  = data.aws_ami.ubuntu.id
    ssh_user             = "ubuntu"
    instance_type        = "t3a.xlarge"
    region               = var.region
    security_group       = var.security_groups
    subnet_id            = local.instance_subnet_id
    vpc_id               = data.aws_vpc.default.id
    zone                 = local.selected_az_suffix
    root_size            = "40"
    volume_type          = "gp2"
    iam_instance_profile = var.iam_instance_profile
  }
}

resource "rancher2_node_pool" "np" {
  count            = local.node_pool_count
  cluster_id       = rancher2_cluster.this.id
  name             = "${local.node_pool_name}-${count.index}"
  hostname_prefix  = "${local.node_pool_name}-pool${count.index}-node"
  node_template_id = rancher2_node_template.this.id
  quantity         = 1
  control_plane    = true
  etcd             = true
  worker           = true
}

resource "rancher2_cluster" "this" {
  name                                                       = var.cluster_name
  default_pod_security_admission_configuration_template_name = "rancher-privileged"
  cluster_auth_endpoint {
    enabled  = true
    fqdn     = null
    ca_certs = null
  }

  rke_config {
    kubernetes_version    = var.k8s_version
    ignore_docker_version = false
    addon_job_timeout     = 60
    enable_cri_dockerd    = true

    cloud_provider {
      name = "aws"
    }
    network {
      plugin = "canal"
    }
    upgrade_strategy {
      drain = true
    }
  }
  depends_on = [rancher2_node_template.this]
}
...
Josh-Diamond commented 6 months ago

Unable to reproduce. I'm able to successfully provision downstream RKE1 clusters using tfp-rancher2 v4.0.0-rc5 and Rancher v2.8.0. I tested this with both AmazonEC2 and Linode.

Josh-Diamond commented 6 months ago

Also unable to reproduce w/ tfp-rancher2 v3.2.0 and Rancher v2.8.0 -- I'm able to successfully provision a downstream RKE1 (AWS) Node driver cluster w/ k8s v1.27.6-rancher1-1

git-ival commented 6 months ago

I honestly haven't attempted it since, if its not reproduceable it was probably unique to the pre-release code at the time.

Josh-Diamond commented 6 months ago

closing out the issue now, as this is no longer reproducible