hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.52k stars 4.6k forks source link

AKS LTS scenario is blocked with terraform extra client side check. #27245

Open haitch opened 2 weeks ago

haitch commented 2 weeks ago

Is there an existing issue for this?

Community Note

Terraform Version

1.6.6

AzureRM Provider Version

3.106.1

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

n/a

Debug Output/Panic Output

n/a

Expected Behaviour

No response

Actual Behaviour

No response

Steps to Reproduce

No response

Important Factoids

No response

References

No response

haitch commented 2 weeks ago

AKS is deprecating version 1.27, and offer through LTS program only : https://learn.microsoft.com/en-us/azure/aks/long-term-support

now customer with cluster on LTS cannot add new nodepool, it was blocked by terraform client side version validation. terraform-provider-azurerm/internal/services/containers/kubernetes_cluster_validate.go clusterControlPlaneMustBeUpgradedError

will-iam-gm commented 2 weeks ago

Additional customer details....

we are currently using the following provider config:

Terraform = 1.6.6
source  = "hashicorp/azurerm"
version = "3.106.1"

Output

╷
│ Error: 
│ The Kubernetes/Orchestrator Version "1.27" is not available for Node Pool "blueuser".
│ 
│ Please confirm that this version is supported by the Kubernetes Cluster "a241281-p01-musea2-aks"
│ (Resource Group "a241281-p01-musea2-rg") - which may need to be upgraded first.
│ 
│ The Kubernetes Cluster is running version "1.27.16".
│ 
│ The supported Orchestrator Versions for this Node Pool/supported by this Kubernetes Cluster are:
│ 
│ 
│ Node Pools cannot use a version of Kubernetes that is not supported on the Control Plane. More
│ details can be found at https://aka.ms/version-skew-policy.
│ 
│ 
│   with module.aks.azurerm_kubernetes_cluster_node_pool.blue_pool,
│   on ../../../modules/azurerm_kubernetes_service/main.tf line 216, in resource "azurerm_kubernetes_cluster_node_pool" "blue_pool":
│  216: resource "azurerm_kubernetes_cluster_node_pool" "blue_pool" {
│ 

Terraform Config

# create aks cluster
resource "azurerm_kubernetes_cluster" "this" {
  name                                = module.tagging.kubernetes_service_id
  resource_group_name                 = var.resource_group_name
  location                            = var.location
  node_resource_group                 = "${var.resource_group_name}-managed"
  dns_prefix                          = var.dns_prefix
  private_cluster_enabled             = true
  private_dns_zone_id                 = "None"
  private_cluster_public_fqdn_enabled = true
  azure_active_directory_role_based_access_control {
    managed            = true
    tenant_id          = data.azurerm_client_config.current.tenant_id
    azure_rbac_enabled = true
  }
  azure_policy_enabled = true
  default_node_pool {
    name                         = "bluecrit"
    vm_size                      = local.crit_node_pool_configs.vm_size
    enable_auto_scaling          = false
    node_count                   = local.crit_node_pool_configs.node_count
    max_pods                     = 110
    only_critical_addons_enabled = true
    os_disk_type                 = var.system_pool_disk_type
    orchestrator_version         = local.crit_node_pool_configs.kubernetes_version
    # Required when using Azure CNI
    vnet_subnet_id              = var.node_pool_subnet_id
    temporary_name_for_rotation = "bluecrittemp"
    tags                        = module.aks_tagging.tags
    zones                       = local.availability_zones
    upgrade_settings {
      max_surge = var.system_node_upgrade_max_surge
    }
  }
  identity {
    type = "UserAssigned"
    identity_ids = [
      azurerm_user_assigned_identity.this.id
    ]
  }
  key_vault_secrets_provider {
    secret_rotation_enabled = true
  }
  kubelet_identity {
    client_id                 = azurerm_user_assigned_identity.this.client_id
    object_id                 = azurerm_user_assigned_identity.this.principal_id
    user_assigned_identity_id = azurerm_user_assigned_identity.this.id
  }
  kubernetes_version = var.kubernetes_version
  # Reference https://learn.microsoft.com/en-us/azure/aks/managed-aad#disable-local-accounts
  local_account_disabled = false
  network_profile {
    network_plugin      = "azure"
    network_policy      = "azure"
    network_plugin_mode = "overlay"
    # http://aka.ms/aks/outboundtype
    outbound_type  = var.kubernetes_outbound_type
    pod_cidr       = "10.244.0.0/14"
    service_cidr   = "172.25.0.0/16"
    dns_service_ip = "172.25.0.10"
  }
  # Required for workload identity
  oidc_issuer_enabled = true
  workload_autoscaler_profile {
    keda_enabled = local.keda_enabled
  }

  # Container insights
  dynamic "oms_agent" {
    for_each = var.log_analytics_workspace_id != "" ? ["oms_agent"] : []

    content {
      log_analytics_workspace_id      = var.log_analytics_workspace_id
      msi_auth_for_monitoring_enabled = true
    }
  }

  maintenance_window_auto_upgrade {
    duration    = var.aks_maintenance_window_auto_upgrade.duration
    frequency   = var.aks_maintenance_window_auto_upgrade.frequency
    interval    = var.aks_maintenance_window_auto_upgrade.interval
    day_of_week = var.aks_maintenance_window_auto_upgrade.day_of_week
    start_time  = var.aks_maintenance_window_auto_upgrade.start_time
    utc_offset  = "+00:00"
  }

  maintenance_window_node_os {
    frequency   = var.aks_node_patch_window.frequency
    interval    = var.aks_node_patch_window.interval
    duration    = var.aks_node_patch_window.duration
    day_of_week = var.aks_node_patch_window.day_of_week
    start_time  = var.aks_node_patch_window.start_time
    utc_offset  = "+00:00"
  }

  automatic_channel_upgrade = "patch"
  node_os_channel_upgrade   = "SecurityPatch"

  workload_identity_enabled = true
  support_plan              = var.kubernetes_support_plan
  sku_tier                  = var.kubernetes_sku_tier
  tags                      = module.aks_tagging.tags

  depends_on = [
    azurerm_role_assignment.aks_to_itself,
    azurerm_role_assignment.aks_network_contributor_subnet,
    azurerm_role_assignment.aks_network_contributor_route_table,
  ]

  lifecycle {
    ignore_changes = [default_node_pool[0].orchestrator_version, kubernetes_version]
  }
}
# create blue user node pool
resource "azurerm_kubernetes_cluster_node_pool" "blue_pool" {
  name                  = "blueuser"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.this.id
  vm_size               = local.blue_node_pool_configs.vm_size
  enable_auto_scaling   = true
  max_count             = local.blue_node_pool_configs.max_count
  min_count             = local.blue_node_pool_configs.min_count
  node_count            = local.blue_node_pool_configs.node_count
  max_pods              = 110
  mode                  = "User"
  orchestrator_version  = local.blue_node_pool_configs.kubernetes_version
  os_disk_type          = var.user_pool_disk_type
  tags                  = module.aks_tagging.tags
  zones                 = local.availability_zones
  upgrade_settings {
    max_surge = var.user_node_upgrade_max_surge
  }
  vnet_subnet_id = var.node_pool_subnet_id

  lifecycle {
    ignore_changes = [node_count, orchestrator_version]
  }
}

kubernetes_support_plan = "AKSLongTermSupport" kubernetes_sku_tier = "Premium" orchestrator_version = "1.27" kubernetes_version = "1.27"

Other Details

ms-henglu commented 2 weeks ago

The azurerm provider uses the "availableAgentpoolVersions" API as a client-side validation, but it seems that this API failed to return the available versions.

az rest -m GET -u 'https://management.azure.com/subscriptions/****/resourceGroups/ac
ctestRG-aks-henglu/providers/Microsoft.ContainerService/managedClusters/acctestakhenglu/availableAgentPoolVersions?api-version=2024-05-01'
{
  "id": "/subscriptions/*****/resourcegroups/acctestRG-aks-henglu/providers/Microsoft.ContainerService/managedClusters/acctestakhenglu/availableagentpoolversions",
  "name": "default",
  "properties": {
    "agentPoolVersions": []
  },
  "type": "Microsoft.ContainerService/managedClusters/availableAgentpoolVersions"
}

Once this API is fixed, the Terraform azurerm provider will be unblocked.

And I have a workaround, which is using the azapi provider which is also a Terraform provider but without any client side validation. Here's an example to create a agent pool.

resource "azapi_resource" "agentPool" {
  type      = "Microsoft.ContainerService/managedClusters/agentPools@2024-05-01"
  parent_id = azurerm_kubernetes_cluster.test.id
  name      = "internal"
  body = {
    properties = {
      count  = 1
      mode   = "User"
      vmSize = "Standard_DS2_v2"
      orchestratorVersion = "1.27"
    }
  }
}

More details: https://learn.microsoft.com/en-us/azure/templates/microsoft.containerservice/managedclusters/agentpools?pivots=deployment-language-terraform

will-iam-gm commented 2 weeks ago

@ms-henglu would you happen to have any ETA on this API fix? We have 96 clusters and we need to provide support for these existing clusters.. at this time I would not want to change my terraform templates to swap out AzureRM for AzAPI. For new clusters we are looking to jump to the new LTS version which is 1.30

Also would this API fix be included in a 3.x patch version or will it only be included in 4.x? Thanks for your help

ms-henglu commented 2 weeks ago

Hi @will-iam-gm , the root cause is on the API side, there’s no action needed from the client side.

please confirm with @haitch about the fix on the API side.

ms-henglu commented 2 weeks ago

Hi @will-iam-gm,

I'd like to share another workaround:

I disabled the version validation in the azurerm provider, and pushed the changes in my fork of azurerm provider. You could compile and use them locally.

Branches: Based v3.116.0: https://github.com/ms-henglu/terraform-provider-azurerm/tree/issue-27245-mitigation-v3.116.0 Based v4.0.1: https://github.com/ms-henglu/terraform-provider-azurerm/tree/issue-27245-mitigation-v4.0.1 If cx want to use other versions, they could cherry-pick this commit: https://github.com/hashicorp/terraform-provider-azurerm/compare/main...ms-henglu:terraform-provider-azurerm:issue-27245-mitigation-v3.116.0 How to use locally complied providers: https://github.com/hashicorp/terraform-provider-azurerm/blob/main/DEVELOPER.md#developer-using-the-locally-compiled-azure-provider-binary

But again, this is just a workaround and temp fix, and this fix will not be able to be included in the public release.

will-iam-gm commented 2 weeks ago

Thanks @ms-henglu

I will be going over our options with my team to see which one best fit our platform..

rovangju commented 1 week ago

Is there anything tracking against the api-side of this? Is there an azure ticket or something?

hahewlet commented 1 week ago

@rovangju I asked Microsoft that question on the following post on Microsoft Learn, https://learn.microsoft.com/en-us/answers/questions/2028718/unable-to-deploy-aks-lts-1-27-in-multiple-regions. Here was the response.

image

haitch commented 1 week ago

I am actually from the API team, the API fix is indeed rolling out, but it will take some time. So the suggested solution is:

  1. use az cli to add new nodepool
  2. use another azure terraform provider azapi , which doesn't have extra validation.
  3. use the workaround @ms-henglu provided above.
rovangju commented 1 week ago

Thanks for the follow up - it's unfortunate that this happened in such a manner, I have closed-loop production environments that are all under terraform control so trying to figure out if I can just sit tight or start pushing for out-of-band workaround.

hahewlet commented 1 week ago

@haitch at the same time we were debugging this issue with AKS 1.27 LTS we noted that 1.27 was removed as an option from eastus, eastus2 and then a few days later from westus2. AKS 1.27 LTS is still missing from these regions. We are getting by with westus and westus3 for now, but we are based on the east coast of the United States. Is all of this related somehow? Will 1.27 be returning to these regions?

e.g. $ az aks get-versions --location eastus2 --output table KubernetesVersion Upgrades


1.30.3 None available 1.30.2 1.30.3 1.30.1 1.30.2, 1.30.3 1.30.0 1.30.1, 1.30.2, 1.30.3 1.29.7 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.29.6 1.29.7, 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.29.5 1.29.6, 1.29.7, 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.29.4 1.29.5, 1.29.6, 1.29.7, 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.29.2 1.29.4, 1.29.5, 1.29.6, 1.29.7, 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.29.0 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7, 1.30.0, 1.30.1, 1.30.2, 1.30.3 1.28.12 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.11 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.10 1.28.11, 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.9 1.28.10, 1.28.11, 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.5 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.3 1.28.5, 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7 1.28.0 1.28.3, 1.28.5, 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.29.0, 1.29.2, 1.29.4, 1.29.5, 1.29.6, 1.29.7

will-iam-gm commented 1 day ago

@hahewlet Microsoft is depreciating 1.27 unless you are using LTS support plan, I dont think you can pick the version from the portal only through az cli or iac

image
will-iam-gm commented 1 day ago

@ms-henglu @haitch With the api rollout fix, today I was able to deploy a node pool using Terraform on version 1.27 with no changes to the provider. Thanks for your help here

hahewlet commented 1 day ago

@will-iam-gm your screenshot gave me the clue I needed. My output for eastus2 did not include 1.27 because my az cli was too old. Once I upgraded that, I can now see the 1.27 versions listed.