[BUG] [ERROR] waiting for cluster (c-cgq6f) to be updated: unexpected state 'error' while upgrading a RKE cluster from 1.23.10 to 1.24+

Turb0Fly commented 1 year ago

Rancher Server Setup

Rancher version: 2.7.5
Installation option (Docker install/Helm Chart): Helm Chart, RKE cluster. Kubernetes v1.24.10

Information about the Cluster

Kubernetes version: 1.23.10
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom TF provisioned

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions: Admin

Provider Information

What is the version of the Rancher v2 Terraform Provider in use? 3.0.2
What is the version of Terraform in use? Terraform v1.0.6
Also using terragrunt version v0.31.8

Describe the bug

When upgrading a custom RKE cluster and going from dockershim to cri_dockerd Terraform errors out :

rancher2_cluster.cluster: Modifying... [id=c-8nkxk]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 10s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 20s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 30s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 40s elapsed]
╷
│ Error: [ERROR] waiting for cluster (c-8nkxk) to be updated: unexpected state 'error', wanted target 'active, provisioning, pending'. last error: %!s(<nil>)     
│
│   with rancher2_cluster.cluster,
│   on rancher.tf line 13, in resource "rancher2_cluster" "cluster":
│   13: resource "rancher2_cluster" "cluster" {

This only happens during a cluster upgrade from 1.23 to 1.24/1.25. The cluster will finish upgrading as shown in the Cluster Management provisioning logs in Rancher. It will eventually show up as Active.
Upgrading the cluster to another higher version of Kubernetes afterwards through Terraform for example 1.26.x will succeed and TF doesn't error out with the unexpected state 'error'
Running a TF plan afterwards will show that kubernetes components have been changed outside of Terraform and that the infrastructure matches the changes.

To Reproduce

Create a custom RKE cluster with out usual TF process running 1.23.10
Change the kubernetes_version in rke_config to 1.24.10 or 1.25.9
Apply the TF change
Terraform plan :

  ~ resource "rancher2_cluster" "cluster" {  
        id                         = "c-8nkxk"\
        name                       = "conpl-1439-test-001"\
        # (15 unchanged attributes hidden) \

      ~ rke_config {
          ~ enable_cri_dockerd    = false -> true
          ~ kubernetes_version    = "v1.23.10-rancher1-1" -> "v1.25.9-rancher2-1"
            # (4 unchanged attributes hidden)

Terraform plan apply : starts applying the plan then errors out on the rancher2_cluster resource :

rancher2_cluster.cluster: Modifying... [id=c-8nkxk]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 10s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 20s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 30s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 40s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 50s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 1m0s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 1m10s elapsed]
rancher2_cluster.cluster: Still modifying... [id=c-8nkxk, 1m20s elapsed]
╷
│ Warning: Experimental feature "module_variable_optional_attrs" is active
│
│   on main.tf line 2, in terraform:
│    2:   experiments = [module_variable_optional_attrs]
│
│ Experimental features are subject to breaking changes in future minor or
│ patch releases, based on feedback.
│
│ If you have feedback on the design of this feature, please open a GitHub
│ issue to discuss it.
╵
╷
│ Error: [ERROR] waiting for cluster (c-8nkxk) to be updated: unexpected state 'error', wanted target 'active, provisioning, pending'. last error: %!s(<nil>)     
│
│   with rancher2_cluster.cluster,
│   on rancher.tf line 13, in resource "rancher2_cluster" "cluster":
│   13: resource "rancher2_cluster" "cluster" {
│
╵
Releasing state lock. This may take a few moments...
time=2023-07-18T15:34:27-04:00 level=error msg=1 error occurred:
        * exit status 1

Actual Result

Expected Result

Upgrading from 1.23 to 1.24 or later should not error out while using the rancher2 Terraform provider.

Screenshots

Additional context

I suspect it has to do with going from dockershim to cri_dockerd through the Rancher2 provider. Just enabling cri_dockerd through Terraform gives out the same error.

iTaybb commented 8 months ago

This seem to still happen when upgarding from 1.25 to 1.26 (rancher 2.7.9)

gorantornqvist-sr commented 6 months ago

Same with kubernetes_version = "v1.26.11-rancher2-1" -> "v1.27.11-rancher1-1"

rancher / terraform-provider-rancher2