[BUG] RKE1 clusters fail to provision with v4.x of tfp

git-ival commented 8 months ago

Rancher Server Setup

Rancher version: 2.8.0-rc1, 2.7.8
Installation option (Docker install/Helm Chart):
- Helm, 1.27.6-rancher1-1
- Helm, 1.26.8-rancher1-1
Proxy/Cert Details: LetsEncrypt

Information about the Cluster

Kubernetes version: 127.6-rancher1-1, 1.26.8—rancher1-1
Cluster Type (Local/Downstream):
- Downstream, AWS nodedriver

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- Admin

Provider Information

What is the version of the Rancher v2 Terraform Provider in use?
- 3.2.0 (unable to provision any nodedriver clusters with this version + rancher v2.8.x)
- master branch - e7f4fa87fca6301bfe58c5fcc6b686da750d7651 (4.x)
What is the version of Terraform in use?
- 1.5.7

Describe the bug

Cannot provision RKE1 nodedriver clusters using v4.x of the terraform provider.

To Reproduce

Setup Local Rancher cluster
Provision an RKE1 nodedriver cluster via rancher2 tfp

Actual Result

Cluster fails to provision with one of the following errors:

[workerPlane] Failed to bring up Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [52.8.134.174]: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused, log: time="2023-10-25T20:17:09Z" level=info msg="Start cri-dockerd grpc backend"]

[controlPlane] Failed to upgrade Control Plane: [[[controlplane] Error getting node ival-sparrow-rke1-ntrke1-workspace-00001-pool0-node1: "ival-sparrow-rke1-ntrke1-workspace-00001-pool0-node1" not found]]

Expected Result

Cluster provisions successfully

Screenshots

Additional Context

main.tf:

...
resource "rancher2_cloud_credential" "this" {
  name = var.cred_name
  amazonec2_credential_config {
    access_key     = var.credential_config.access_key
    secret_key     = var.credential_config.secret_key
    default_region = var.credential_config.region
  }
}

resource "rancher2_node_template" "this" {
  name                  = var.template_name
  cloud_credential_id   = rancher2_cloud_credential.this.id
  engine_storage_driver = "overlay2"

  amazonec2_config {
    ami                  = data.aws_ami.ubuntu.id
    ssh_user             = "ubuntu"
    instance_type        = "t3a.xlarge"
    region               = var.region
    security_group       = var.security_groups
    subnet_id            = local.instance_subnet_id
    vpc_id               = data.aws_vpc.default.id
    zone                 = local.selected_az_suffix
    root_size            = "40"
    volume_type          = "gp2"
    iam_instance_profile = var.iam_instance_profile
  }
}

resource "rancher2_node_pool" "np" {
  count            = local.node_pool_count
  cluster_id       = rancher2_cluster.this.id
  name             = "${local.node_pool_name}-${count.index}"
  hostname_prefix  = "${local.node_pool_name}-pool${count.index}-node"
  node_template_id = rancher2_node_template.this.id
  quantity         = 1
  control_plane    = true
  etcd             = true
  worker           = true
}

resource "rancher2_cluster" "this" {
  name                                                       = var.cluster_name
  default_pod_security_admission_configuration_template_name = "rancher-privileged"
  cluster_auth_endpoint {
    enabled  = true
    fqdn     = null
    ca_certs = null
  }

  rke_config {
    kubernetes_version    = var.k8s_version
    ignore_docker_version = false
    addon_job_timeout     = 60
    enable_cri_dockerd    = true

    cloud_provider {
      name = "aws"
    }
    network {
      plugin = "canal"
    }
    upgrade_strategy {
      drain = true
    }
  }
  depends_on = [rancher2_node_template.this]
}
...

Josh-Diamond commented 6 months ago

Unable to reproduce. I'm able to successfully provision downstream RKE1 clusters using tfp-rancher2 v4.0.0-rc5 and Rancher v2.8.0. I tested this with both AmazonEC2 and Linode.

Josh-Diamond commented 6 months ago

Also unable to reproduce w/ tfp-rancher2 v3.2.0 and Rancher v2.8.0 -- I'm able to successfully provision a downstream RKE1 (AWS) Node driver cluster w/ k8s v1.27.6-rancher1-1

git-ival commented 6 months ago

I honestly haven't attempted it since, if its not reproduceable it was probably unique to the pre-release code at the time.

Josh-Diamond commented 6 months ago

closing out the issue now, as this is no longer reproducible

rancher / terraform-provider-rancher2