azurerm_kubernetes_cluster must be replaced due to upgrade_settings drain_timeout_in_minutes

Noel-Jones commented 1 month ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave comments along the lines of "+1", "me too" or "any updates", they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Terraform Version

1.9.0

AzureRM Provider Version

3.106.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "=3.105.0"
    }
  }
  required_version = ">= 1.0"
}

# Configure the default provider.
provider "azurerm" {
  subscription_id = "xxxxxxxx"
  features {}
}

resource "azurerm_kubernetes_cluster" "default" {
  name = "aks-deleteme"
  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_B2as_v2"
  }
  location            = "uksouth"
  resource_group_name = "xxxxxxxx"
  identity {
    type = "SystemAssigned"
  }
  dns_prefix="aks-delete-me-dns"
}

Debug Output/Panic Output

I believe the relevant part of the debug would be this showing that the upgrade_settings block has been changed.

2024-07-08T11:50:50.229+0100 [WARN]  Provider "registry.terraform.io/hashicorp/azurerm" produced an unexpected new value for module.k8s_cluster["staging"].azurerm_kubernetes_cluster.aks during refresh.
      - .cost_analysis_enabled: was null, but now cty.False
      - .default_node_pool[0].upgrade_settings: block count changed from 0 to 1
2024-07-08T11:50:50.272+0100 [WARN]  Provider "registry.terraform.io/hashicorp/azurerm" produced an invalid plan for module.k8s_cluster["staging"].azurerm_kubernetes_cluster.aks, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .enable_pod_security_policy: planned value cty.False for a non-computed attribute

Expected Behaviour

There should be no changes to the existing AKS resource. The cluster should certainly not be recreated.

Actual Behaviour

The AKS resource is planned to be replaced:

          - upgrade_settings {
              - drain_timeout_in_minutes      = 30 -> null # forces replacement
              - node_soak_duration_in_minutes = 0 -> null
                # (1 unchanged attribute hidden)
            }

Steps to Reproduce

Terraform apply to Create cluster with provider version 3.105.0

Update provider version to 3.106 (or up to 3.111 at time of writing)

(Terraform plan now will only show a change)

Upgrade the cluster kubernetes version (in my case it was created with 1.28.10 and upgraded to 1.29.5) and I upgraded the control plane then the node plane.

Terraform plan will now want to recreate the cluster.

I hesitated to raise this having scanned through the other issues raised since 3.106 but they are all closed and planning to recreate the cluster seems wrong. I'm not sure of the appropriate resolution. There is clearly an external change to the resource brought about by the cluster upgrade.

Before the cluster upgrade aks nodepool show returns:

  "upgradeSettings": {
    "drainTimeoutInMinutes": null,
    "maxSurge": "10%",
    "nodeSoakDurationInMinutes": null
  },

After the cluster upgrade aks nodepool show returns:

  "upgradeSettings": {
    "drainTimeoutInMinutes": 30,
    "maxSurge": "10%",
    "nodeSoakDurationInMinutes": null
  },

It seems to me that the defaults for the upgrade settings should be 30 for drainTimeoutInMinutes (currently optional/null), 10% or 1 for maxSurge (you could argue 1 will have least impact on existing clusters) and null for nodeSoakDurationInMinutes (as now).

Maybe the drainTimeoutInMinutes and maxSurge should both be required but with further explaination in the documentation?

Important Factoids

No response

References

No response

jwalaszek commented 1 month ago

I also experience this issue. Change that causes this behavior was added in PR #26137. Here is the line that forces recreation: https://github.com/hashicorp/terraform-provider-azurerm/pull/26137/files#diff-b7c1d78864b169130b7e4c32a1ff2667efb4a47c9713e26f1a888cb9e8e582bbR71

jwalaszek commented 1 month ago

I think I found a workaround. Adding upgrade_settings block to node pool configuration prevents Terraform from trying to recreate whole cluster. This works for both azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool


resource "azurerm_kubernetes_cluster" "default" {
  default_node_pool {
    upgrade_settings {
      max_surge                = "10%"
      drain_timeout_in_minutes = 30
    }
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
  upgrade_settings {
    max_surge                = "10%"
    drain_timeout_in_minutes = 30
  }
}

d-mankowski-synerise commented 1 week ago

We upgraded our cluster (via az CLI), and then upgraded azurerm provider from 3.56.0 to 4.0.1. And we are stuck in such loop:

Changing drain_timeout_in_minutes forces node pool to be recreated. But we cannot upgrade upgrade_settings, because it requires parameter max_surge:

And we cannot set max_surge for spot pools, as it is not allowed:

And Terraform does not allow to ignore changes inside dynamic blocks.

In fact, even removing state and importing it from scratch causes this behavior. The azurerm provider has some serious bugs.

hashicorp / terraform-provider-azurerm