hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.65k forks source link

AKS cluster update tags (or other non important parameters) tf apply fails with error: Error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" #20791

Open slzmruepp opened 1 year ago

slzmruepp commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.3.7

AzureRM Provider Version

3.46.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster_node_pool

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "aks_cluster" {
  name                            = var.name
  location                        = var.location
  resource_group_name             = var.resource_group_name
  kubernetes_version              = var.kubernetes_version
  dns_prefix                      = var.dns_prefix
  private_cluster_enabled         = var.private_cluster_enabled
  automatic_channel_upgrade       = var.automatic_channel_upgrade
  sku_tier                        = var.sku_tier
  tags                            = local.tags

  default_node_pool {
    name                          = var.default_node_pool_name
    vm_size                       = var.default_node_pool_vm_size
    vnet_subnet_id                = var.vnet_subnet_id
    node_labels                   = var.default_node_pool_node_labels
    only_critical_addons_enabled  = var.default_node_pool_critical_addons_enabled
    enable_auto_scaling           = var.default_node_pool_enable_auto_scaling
    enable_host_encryption        = var.default_node_pool_enable_host_encryption
    enable_node_public_ip         = var.default_node_pool_enable_node_public_ip
    max_pods                      = var.default_node_pool_max_pods
    max_count                     = var.default_node_pool_max_count
    min_count                     = var.default_node_pool_min_count
    node_count                    = var.default_node_pool_node_count
    os_disk_type                  = var.default_node_pool_os_disk_type
    zones                         = var.default_node_pool_availability_zones
    tags                          = local.tags
  }

  azure_active_directory_role_based_access_control {
    managed                       = true
    tenant_id                     = var.tenant_id
    admin_group_object_ids        = var.admin_group_object_ids
    azure_rbac_enabled            = var.azure_rbac_enabled
  }

  identity {
    type                          = "UserAssigned"
    identity_ids                  = toset([azurerm_user_assigned_identity.aks_identity.id])
  }

  key_vault_secrets_provider {
    secret_rotation_enabled       = true
    secret_rotation_interval      = "2m"
  }

  linux_profile {
    admin_username                = var.admin_username
    ssh_key {
        key_data                  = var.ssh_public_key
    }
  }

  oms_agent {
    log_analytics_workspace_id = coalesce(var.oms_agent.log_analytics_workspace_id, var.log_analytics_workspace_id)
/*    oms_agent_identity {
      user_assigned_identity_id   = azurerm_user_assigned_identity.aks_identity.id              
    }*/
  }

  maintenance_window {
    allowed {
      day                         = var.maint_window.day
      hours                       = var.maint_window.hours      
    }
  }

  network_profile {
    docker_bridge_cidr = var.network_docker_bridge_cidr
    dns_service_ip     = var.network_dns_service_ip
    network_plugin     = var.network_plugin
    outbound_type      = var.outbound_type
    service_cidr       = var.network_service_cidr
  }

  lifecycle {
    ignore_changes = [
      kubernetes_version,
      default_node_pool.0.node_count
    ]
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
  kubernetes_cluster_id        = var.kubernetes_cluster_id
  name                         = var.name
  vm_size                      = var.vm_size
  mode                         = var.mode
  node_labels                  = var.node_labels
  node_taints                  = var.node_taints
  zones                        = var.availability_zones
  vnet_subnet_id               = var.vnet_subnet_id
  enable_auto_scaling          = var.enable_auto_scaling
  enable_host_encryption       = var.enable_host_encryption
  enable_node_public_ip        = var.enable_node_public_ip
  proximity_placement_group_id = var.proximity_placement_group_id
  orchestrator_version         = var.orchestrator_version
  max_pods                     = var.max_pods
  max_count                    = var.max_count
  min_count                    = var.min_count
  node_count                   = var.node_count
  os_disk_size_gb              = var.os_disk_size_gb
  os_disk_type                 = var.os_disk_type
  os_type                      = var.os_type
  priority                     = var.priority
  tags                         = local.tags

  lifecycle {
    ignore_changes = [
      orchestrator_version,
      node_count
    ]
  }
}

Debug Output/Panic Output

╷
│ Error: updating Managed Cluster (Subscription: "XXX"
│ Resource Group Name: "XXX"
│ Managed Cluster Name: "XXX"): managedclusters.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" Message="Using managed cluster api, all Agent pools' OrchestratorVersion must be all specified or all unspecified. If all specified, they must be stay unchanged or the same with control plane. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api"
│ 
│   with module.aks_cluster.azurerm_kubernetes_cluster.aks_cluster,
│   on modules/aks/main.tf line 32, in resource "azurerm_kubernetes_cluster" "aks_cluster":
│   32: resource "azurerm_kubernetes_cluster" "aks_cluster" {
│ 
╵

Expected Behaviour

It should be possible to update tags on aks cluster and associated node pool without failure.

Actual Behaviour

We run a private aks cluster. It consists of two node pools, user and system. We enabled auto_channel_upgrade. There is an azure bug which is flaky that the azure gui under node pools shows different versions after an upgrade. For example, system nodepool is 1.24.9, user nodepool shows still 1.24.6. But when trying to upgrade the user nodepool it shows the "current" version is already 1.24.9. Also kubectl get nodes shows a consistent picture of all nodes are on 1.24.9. However, obviously, the API terraform uses, gives back the wrong version and in former runs, we did not have the orchestrator_version lifecycle enabled on the nodepool, so it tried to upgrade the nodepool version to 1.24.9. This used to work from time to time, funnily usually in the mornings 8-17 CET it did not work, but after that the run worked sometimes.

 lifecycle {
    ignore_changes = [
      node_count
    ]
  }

We then changed to:

 lifecycle {
    ignore_changes = [
      orchestrator_version,
      node_count
    ]
  }

but also the runs with orchestrator_version ignore enabled, will fail. The nodepool and cluster are only upgrading tags:

# module.node_pool.azurerm_kubernetes_cluster_node_pool.node_pool will be updated in-place
  ~ resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
        id                      = "/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.ContainerService/managedClusters/XXX/agentPools/user"
        name                    = "user"
      ~ tags                    = {
          ~ "version" = "1.26.3-rc.4" -> "1.26.5-rc.8"
            # (5 unchanged elements hidden)
        }
        # (26 unchanged attributes hidden)
    }
  # module.aks_cluster.azurerm_kubernetes_cluster.aks_cluster will be updated in-place
  ~ resource "azurerm_kubernetes_cluster" "aks_cluster" {
        id                                  = "/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.ContainerService/managedClusters/XXX"
        name                                = "XXX"
      ~ tags                                = {
          ~ "version" = "1.26.3-rc.4" -> "1.26.5-rc.8"
            # (5 unchanged elements hidden)
        }
        # (29 unchanged attributes hidden)

      ~ default_node_pool {
            name                         = "system"
          ~ tags                         = {
              ~ "version" = "1.26.3-rc.4" -> "1.26.5-rc.8"
                # (5 unchanged elements hidden)
            }
            # (23 unchanged attributes hidden)
        }

        # (11 unchanged blocks hidden)
    }

The error above is from run with orchestrator_version lifecycle enabled.

Steps to Reproduce

Create cluster and nodepool with auto channel upgrade stable Enable Kubernetes version and orchestrator version lifecycle ignore on cluster and nodepool. Wait for azure to run a stable version upgrade. Try to reapply terraform by only updating tags.

Important Factoids

No response

References

No response

cippaciong commented 1 year ago

I'm having a similar issue when I try to upgrade a cluster from v1.22 to v1.23 without automatic_channel_upgrade.

My AKS configuration is pretty minimal:

resource "azurerm_kubernetes_cluster" "k8s" {
  resource_group_name               = data.azurerm_resource_group.resource_group.name
  name                              = data.azurerm_resource_group.resource_group.name
  dns_prefix                        = local.cname
  location                          = data.azurerm_resource_group.resource_group.location
  kubernetes_version                = var.k8s_version
  role_based_access_control_enabled = true

  default_node_pool {
    name                = "default"
    min_count           = var.k8s_min_agent_count
    max_count           = var.k8s_max_agent_count
    enable_auto_scaling = true
    vm_size             = var.k8s_agent_size
    os_disk_size_gb     = 50
  }

  identity {
    type = "SystemAssigned"
  }

  tags = {
    terraform = true
  }
}

As you can see, I don't set any orchestrator_version explicitly in the default_node_pool configuration.

According to the docs: orchestrator_version: (Optional) Version of Kubernetes used for the Agents. If not specified, the default node pool will be created with the version specified by kubernetes_version.
In my case, it should be v1.23 then.

If I inspect the payload of the PUT request made by terraform to the Azure API with TF_LOG=trace though, I see that kubernetes_version and orchestrator_version are different (I omitted all the unrelated fields):

{
  "properties": {
    "kubernetesVersion": "1.23",
    "agentPoolProfiles": [
      {
        "currentOrchestratorVersion": "1.22.15",
        "orchestratorVersion": "1.22"
      }
    ]
  }
}

And I get the same error from the API:

managedclusters.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" Message="Using managed cluster api, all Agentpools' OrchestratorVersion must be all specified or all unspecified. If all specified, they must be stay unchanged or the same with control plane. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api"

I'm wondering if this could be related to #18130

aydosman commented 1 year ago

This issue has also affected us, and it seems that the behavior has suddenly changed with the new version of the AzureRM Provider released on Friday. It's possible that the new version of the provider has introduced changes or regressions that impact the behavior many of you have experienced.

I recommend that everyone retest their configurations with the new provider version to determine if the issue persists or has been resolved. It would be appropriate for Azure engineers to comment on this issue and any subsequent changes in the provider (if any).