Gracefully handle Azure Virtual Machines shutdown within Kubernetes
This tool ensures that the Kubernetes cluster responds appropriately to events that can cause your Azure Virtual Machines to become unavailable, such as evictions of Azure Spot Virtual Machines or reboots. If not handled, your application code may not stop gracefully, recovery to full availability may take longer, or work might accidentally be scheduled to nodes that are shutting down. This tool can also send Telegram, Slack or Webhook messages before Azure Virtual Machines evictions occur.
Based on Azure Scheduled Events and Safely Drain a Node
Support Linux (amd64, arm64) and Windows 2022, 2019* (amd64) nodes.
helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical
You can compose your payload with markers that are described here
You need to install Azure Command-Line Interface, also you need setup kubectl to your AKS cluster
# Azure CLI version is 2.61.0
az --version
# Choose your AKS node to simulate eviction
kubectl get no
# Identify your node Azure ID
# subscriptions/{}/resourceGroups/{}/providers/Microsoft.Compute/virtualMachineScaleSets/{}/virtualMachines/{}
kubectl get no aks-nodename-to-simulate-eviction -o json | jq -r '.spec.providerID[9:]'
# Append to your node Azure ID additional path /simulateEviction?api-version=2024-03-01
# And execute this simulation with management.azure.com
az rest --verbose -m post --header "Accept=application/json" -u "https://management.azure.com/{Azure ID}/simulateEviction?api-version=2024-03-01"
You can test with Simulate Eviction API and change API endpoint to correspond virtualMachineScaleSets
that are used in AKS.
POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmScaleSetName}/virtualMachines/{instanceId}/simulateEviction?api-version=2021-11-01
The application exposes Prometheus metrics at the /metrics
endpoint. Installing the latest chart will add annotations to the pods:
annotations:
prometheus.io/port: "17923"
prometheus.io/scrape: "true"
If your cluster has (Linux and Windows 2019 nodes), you need to use another image:
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:latest-ltsc2019
If your cluster includes Linux, Windows 2022, and Windows 2019 nodes, you will need two separate helm installations of aks-node-termination-handler
, each with different values.
# install aks-node-termination-handler for Linux and Windows 2022 nodes
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values=linux-windows2022.values.yaml
# install aks-node-termination-handler for Windows 2019 nodes
helm upgrade aks-node-termination-handler-windows-2019 \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values=linux-windows2019.values.yaml
For OpenShift clusters that use Azure computes for their nodes, you must enable pod hostNetwork support because OpenShift networking has a restriction for using Azure Metadata Service.
This support can be enabled with --set hostNetwork=true
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set hostNetwork=true