Terraform does not destroy the AKS while upgrading AKS and fails - "Waiting for the deletion of Node Pool"

rogerioefonseca commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.2.4

AzureRM Provider Version

3.12.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "main" {
  name                    = var.cluster_name
  kubernetes_version      = var.kubernetes_version
  location                = var.resource_group_obj.location
  resource_group_name     = var.resource_group_obj.name
  dns_prefix              = var.prefix
  sku_tier                = var.sku_tier
  private_cluster_enabled = var.private_cluster_enabled
  private_dns_zone_id     = var.private_dns_zone_id

  default_node_pool {
    orchestrator_version         = var.orchestrator_version
    name                         = var.agents_pool_name
    node_count                   = var.agents_count
    vm_size                      = var.agents_size
    os_disk_size_gb              = var.os_disk_size_gb
    vnet_subnet_id               = var.vnet_subnet_id
    enable_auto_scaling          = var.enable_auto_scaling
    enable_node_public_ip        = var.enable_node_public_ip
    zones                        = var.agents_availability_zones
    node_labels                  = var.agents_labels
    type                         = var.agents_type
    tags                         = merge(var.tags, var.agents_tags)
    max_pods                     = var.agents_max_pods
    enable_host_encryption       = var.enable_host_encryption
    only_critical_addons_enabled = var.only_critical_addons_enabled
  }

  dynamic "service_principal" {
    for_each = var.client_id != "" && var.client_secret != "" ? ["service_principal"] : []
    content {
      client_id     = var.client_id
      client_secret = var.client_secret
    }
  }

  identity {
    type         = "UserAssigned"
    identity_ids = azurerm_user_assigned_identity.cluster_identity[0].id
  }

  http_application_routing_enabled = false
  automatic_channel_upgrade        = var.automatic_channel_upgrade
  azure_policy_enabled             = var.enable_azure_policy

  role_based_access_control_enabled = var.enable_role_based_access_control

  dynamic "azure_active_directory_role_based_access_control" {
    for_each = var.enable_role_based_access_control && var.rbac_aad_managed ? ["rbac"] : []
    content {
      managed                = true
      admin_group_object_ids = var.rbac_aad_admin_group_object_ids
    }
  }

  dynamic "azure_active_directory_role_based_access_control" {
    for_each = var.enable_role_based_access_control && !var.rbac_aad_managed ? ["rbac"] : []
    content {
      managed           = false
      client_app_id     = var.rbac_aad_client_app_id
      server_app_id     = var.rbac_aad_server_app_id
      server_app_secret = var.rbac_aad_server_app_secret
    }
  }

  network_profile {
    network_plugin     = var.network_plugin
    network_policy     = var.network_policy
    dns_service_ip     = var.net_profile_dns_service_ip
    docker_bridge_cidr = var.net_profile_docker_bridge_cidr
    outbound_type      = var.net_profile_outbound_type
    pod_cidr           = var.net_profile_pod_cidr
    service_cidr       = var.net_profile_service_cidr
  }

  tags = var.tags
}

Debug Output/Panic Output

ks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 10m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 10m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 11m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 12m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 13m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 14m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 15m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 16m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 17m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m20s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m30s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m40s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 18m50s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 19m0s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 19m10s elapsed]
module.aks.azurerm_kubernetes_cluster_node_pool.this["<<pool_name>>"]: Still destroying... [id=/subscriptions/<<subscription_id>>-.../-test-aweu/agentPools/<<pool_name>>, 19m20s elapsed]i

Expected Behaviour

The provider should destroying the actual nodes and spawn new ones with the new version.

Actual Behaviour

Terraform just keeps trying to destroy the resources and after deleting 3 of the 4 nodes in total, it fails: ` ╷ │ Warning: Experimental feature "module_variable_optional_attrs" is active │ │ on .terraform/modules/aks/modules/aks/version.tf line 8, in terraform: │ 8: experiments = [module_variable_optional_attrs] │ │ Experimental features are subject to breaking changes in future minor or │ patch releases, based on feedback. │ │ If you have feedback on the design of this feature, please open a GitHub │ issue to discuss it. │ │ (and 4 more similar warnings elsewhere) ╵ ╷ │ Error: waiting for the deletion of Node Pool: (Agent Pool Name "apppool" / Managed Cluster Name "platform-test-aweu" / Resource Group "platform-test-aweu"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-apppool-27993491-vmss00000t did not complete pods [ingress-nginx-controller-7fdc8d7588-t4f9s]: Too many req pod ingress-nginx-controller-7fdc8d7588-t4f9s on node aks-apppool-27993491-vmss00000t. See http://aka.ms/aks/debugdrainfailures" │ │ ╵

[error]Terraform command 'apply' failed with exit code '1'.

[error]╷

│ Error: waiting for the deletion of Node Pool: (Agent Pool Name "apppool" / Managed Cluster Name "platform-test-aweu" / Resource Group "platform-test-aweu"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-apppool-27993491-vmss00000t did not complete pods [ingress-nginx-controller-7fdc8d7588-t4f9s]: Too many req pod ingress-nginx-controller-7fdc8d7588-t4f9s on node aks-apppool-27993491-vmss00000t. See http://aka.ms/aks/debugdrainfailures" │ │ ╵ `

Steps to Reproduce

terraform plan terraform apply

Important Factoids

To workaround it I needed to delete the poddisruptionbudget and then rerun the pipeline: kubectl delete poddisruptionbudgets --all-namespaces

References

No response

dennis-menge commented 2 years ago

This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster

Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.

rogerioefonseca commented 2 years ago

This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster

Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.

Yep, that is right. I took a little deep into that configuration and as far as I could understand it should not break, since all my PDBs are allowing at least 1 Disruption.

The behavior that I noticed that could be the problem is:

Terraform first destroyed all the nodes from my app-pool
but that does not happen because of the PDB and an error is triggered
Then the last node can not be drained, because there were no other nodes to place the pods.

Not sure if I made myself clear. Let me know your thoughts...

rogerioefonseca commented 2 years ago

This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster
Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.
Yep, that is right. I took a little deep into that configuration and as far as I could understand it should not break, since all my PDBs are allowing at least 1 Disruption.

The behavior that I noticed that could be the problem is:

Terraform first destroyed all the nodes from my app-pool

but that does not happen because of the PDB and an error is triggered

Then the last node can not be drained, because there were no other nodes to place the pods.

Not sure if I made myself clear. Let me know your thoughts...

Ping

kvietmeier commented 2 years ago

This is still an issue.

I am hitting this running a simple "terraform destroy"

I verified that the pdb was 1 before running tf destroy but as soon I noticed it hanging on deleting the second nodepool group I checked and it was set to 0 and the nodepool couldn't be removed. It hangs on removing the last metrics-server pod.

This is new behavior on this cluster and the only things changed from the previous TF code is:

Added second node pool
Using Helm Chart to install Cilium plugin
Using Proximity Placement Group

Source is here - aks-2 is the offending setup: https://github.com/kvietmeier/Terraform/tree/master/azure/testing

Actual syntax for deleting pdb:

kubectl delete pdb <pod name> -n <namespace>
kubectl delete pdb metrics-server-pdb -n kube-system

kvietmeier commented 2 years ago

Just verified - basic cluster using azure plugin, no Helm, no Cilium, it destroys fine.

Put Helm/Cilium back in - soon as the first nodepool removal process starts, the metrics-server-pdb gets set to 0.

KV C:\Users\ksvietme\repos> kubectl get poddisruptionbudget -A
NAMESPACE     NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-system   coredns-pdb          1               N/A               1                     12m
kube-system   metrics-server-pdb   1               N/A               1                     12m

KV C:\Users\ksvietme\repos> kubectl get poddisruptionbudget -A
NAMESPACE     NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-system   coredns-pdb          1               N/A               1                     12m
kube-system   metrics-server-pdb   1               N/A               0                     12m

yurissouza commented 1 year ago

Hi @kvietmeier , have you managed to overcome this issue? I'm going thrugh the same problem that you had and apparently with the same setup (using AKS + Clium installed via helm).

majhe commented 1 year ago

I´m having the same issue. I´m suspecting this is due to CriticalOnly flag assigned on System/Default Pool (while running only 1 node). Still figuring out if there's something that can be done on terraform side (I doubt it, maybe forcing cluster removal, this works while doing manually from portal).

oferchen commented 10 months ago

https://github.com/Azure/AKS/issues/3384

hashicorp / terraform-provider-azurerm