Open rogerioefonseca opened 2 years ago
This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster
Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.
This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster
Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.
Yep, that is right. I took a little deep into that configuration and as far as I could understand it should not break, since all my PDBs are allowing at least 1 Disruption.
The behavior that I noticed that could be the problem is:
Not sure if I made myself clear. Let me know your thoughts...
This is intended behavior from AKS side, see: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#upgrade-an-aks-cluster
Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.
Yep, that is right. I took a little deep into that configuration and as far as I could understand it should not break, since all my PDBs are allowing at least 1 Disruption.
The behavior that I noticed that could be the problem is:
- Terraform first destroyed all the nodes from my app-pool
- but that does not happen because of the PDB and an error is triggered
- Then the last node can not be drained, because there were no other nodes to place the pods.
Not sure if I made myself clear. Let me know your thoughts...
Ping
This is still an issue.
I am hitting this running a simple "terraform destroy"
I verified that the pdb was 1 before running tf destroy but as soon I noticed it hanging on deleting the second nodepool group I checked and it was set to 0 and the nodepool couldn't be removed. It hangs on removing the last metrics-server pod.
This is new behavior on this cluster and the only things changed from the previous TF code is:
Source is here - aks-2 is the offending setup: https://github.com/kvietmeier/Terraform/tree/master/azure/testing
Actual syntax for deleting pdb:
kubectl delete pdb <pod name> -n <namespace>
kubectl delete pdb metrics-server-pdb -n kube-system
Just verified - basic cluster using azure plugin, no Helm, no Cilium, it destroys fine.
Put Helm/Cilium back in - soon as the first nodepool removal process starts, the metrics-server-pdb gets set to 0.
KV C:\Users\ksvietme\repos> kubectl get poddisruptionbudget -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system coredns-pdb 1 N/A 1 12m
kube-system metrics-server-pdb 1 N/A 1 12m
KV C:\Users\ksvietme\repos> kubectl get poddisruptionbudget -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system coredns-pdb 1 N/A 1 12m
kube-system metrics-server-pdb 1 N/A 0 12m
Hi @kvietmeier , have you managed to overcome this issue? I'm going thrugh the same problem that you had and apparently with the same setup (using AKS + Clium installed via helm).
I´m having the same issue. I´m suspecting this is due to CriticalOnly flag assigned on System/Default Pool (while running only 1 node). Still figuring out if there's something that can be done on terraform side (I doubt it, maybe forcing cluster removal, this works while doing manually from portal).
Is there an existing issue for this?
Community Note
Terraform Version
1.2.4
AzureRM Provider Version
3.12.0
Affected Resource(s)/Data Source(s)
azurerm_kubernetes_cluster
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
The provider should destroying the actual nodes and spawn new ones with the new version.
Actual Behaviour
Terraform just keeps trying to destroy the resources and after deleting 3 of the 4 nodes in total, it fails: ` ╷ │ Warning: Experimental feature "module_variable_optional_attrs" is active │ │ on .terraform/modules/aks/modules/aks/version.tf line 8, in terraform: │ 8: experiments = [module_variable_optional_attrs] │ │ Experimental features are subject to breaking changes in future minor or │ patch releases, based on feedback. │ │ If you have feedback on the design of this feature, please open a GitHub │ issue to discuss it. │ │ (and 4 more similar warnings elsewhere) ╵ ╷ │ Error: waiting for the deletion of Node Pool: (Agent Pool Name "apppool" / Managed Cluster Name "platform-test-aweu" / Resource Group "platform-test-aweu"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-apppool-27993491-vmss00000t did not complete pods [ingress-nginx-controller-7fdc8d7588-t4f9s]: Too many req pod ingress-nginx-controller-7fdc8d7588-t4f9s on node aks-apppool-27993491-vmss00000t. See http://aka.ms/aks/debugdrainfailures" │ │ ╵
[error]Terraform command 'apply' failed with exit code '1'.
[error]╷
│ Error: waiting for the deletion of Node Pool: (Agent Pool Name "apppool" / Managed Cluster Name "platform-test-aweu" / Resource Group "platform-test-aweu"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-apppool-27993491-vmss00000t did not complete pods [ingress-nginx-controller-7fdc8d7588-t4f9s]: Too many req pod ingress-nginx-controller-7fdc8d7588-t4f9s on node aks-apppool-27993491-vmss00000t. See http://aka.ms/aks/debugdrainfailures" │ │ ╵ `
Steps to Reproduce
terraform plan
terraform apply
Important Factoids
To workaround it I needed to delete the poddisruptionbudget and then rerun the pipeline:
kubectl delete poddisruptionbudgets --all-namespaces
References
No response