maksim-paskal / aks-node-termination-handler

Gracefully handle Azure Virtual Machines shutdown within Kubernetes
Apache License 2.0
39 stars 9 forks source link
aks kubernetes maintenance-events spot-instances

codecov Docker Pulls Licence

AKS Node Termination Handler

Gracefully handle Azure Virtual Machines shutdown within Kubernetes

Motivation

This tool ensures that the Kubernetes cluster responds appropriately to events that can cause your Azure Virtual Machines to become unavailable, such as evictions of Azure Spot Virtual Machines or reboots. If not handled, your application code may not stop gracefully, recovery to full availability may take longer, or work might accidentally be scheduled to nodes that are shutting down. This tool can also send Telegram, Slack or Webhook messages before Azure Virtual Machines evictions occur.

Based on Azure Scheduled Events and Safely Drain a Node

Support Linux (amd64, arm64) and Windows 2022, 2019* (amd64) nodes.

Create Azure Kubernetes Cluster

Create basic AKS cluster with Azure CLI ```bash # https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli # Azure CLI version is 2.50.0 az --version # Create resource group az group create \ --name test-aks-group-eastus \ --location eastus # Create aks cluster, with not spot instances az aks create \ --resource-group test-aks-group-eastus \ --name MyManagedCluster \ --node-count 1 \ --node-vm-size Standard_DS2_v2 \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 3 # Create Linux nodepool with Spot Virtual Machines and autoscaling az aks nodepool add \ --resource-group test-aks-group-eastus \ --cluster-name MyManagedCluster \ --name spotpool \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --node-vm-size Standard_DS2_v2 \ --min-count 0 \ --max-count 10 # Create Windows (Windows Server 2022) nodepool with Spot Virtual Machines and autoscaling az aks nodepool add \ --resource-group test-aks-group-eastus \ --cluster-name MyManagedCluster \ --os-type Windows \ --os-sku Windows2022 \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --name spot01 \ --min-count 1 \ --max-count 3 # Create Windows (Windows Server 2019) nodepool with Spot Virtual Machines and autoscaling az aks nodepool add \ --resource-group test-aks-group-eastus \ --cluster-name MyManagedCluster \ --os-type Windows \ --os-sku Windows2019 \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --name spot2 \ --min-count 1 \ --max-count 3 # Get config to connect to cluster az aks get-credentials \ --resource-group test-aks-group-eastus \ --name MyManagedCluster ```

Installation

helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical

Send notification events

You can compose your payload with markers that are described here

Send Telegram notification ```bash helm upgrade aks-node-termination-handler \ --install \ --namespace kube-system \ aks-node-termination-handler/aks-node-termination-handler \ --set priorityClassName=system-node-critical \ --set 'args[0]=-telegram.token=' \ --set 'args[1]=-telegram.chatID=' ```
Send Slack notification ```bash # create payload file cat <
Send Prometheus Pushgateway event ```bash cat <
Use an HTTP proxy for making webhook requests Use the flag `-webhook.http-proxy=http://someproxy:3128` for making requests with a proxy. This flag can use HTTP or HTTPS addresses. You can also use basic auth. ```bash cat <

Simulate eviction

Using Azure CLI

You need to install Azure Command-Line Interface, also you need setup kubectl to your AKS cluster

# Azure CLI version is 2.61.0
az --version

# Choose your AKS node to simulate eviction
kubectl get no

# Identify your node Azure ID
# subscriptions/{}/resourceGroups/{}/providers/Microsoft.Compute/virtualMachineScaleSets/{}/virtualMachines/{}
kubectl get no aks-nodename-to-simulate-eviction -o json | jq -r '.spec.providerID[9:]'

# Append to your node Azure ID additional path /simulateEviction?api-version=2024-03-01
# And execute this simulation with management.azure.com
az rest --verbose -m post --header "Accept=application/json" -u "https://management.azure.com/{Azure ID}/simulateEviction?api-version=2024-03-01"

Using browser

You can test with Simulate Eviction API and change API endpoint to correspond virtualMachineScaleSets that are used in AKS.

POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmScaleSetName}/virtualMachines/{instanceId}/simulateEviction?api-version=2021-11-01

Metrics

The application exposes Prometheus metrics at the /metrics endpoint. Installing the latest chart will add annotations to the pods:

annotations:
  prometheus.io/port: "17923"
  prometheus.io/scrape: "true"

Windows 2019 support

If your cluster has (Linux and Windows 2019 nodes), you need to use another image:

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:latest-ltsc2019

If your cluster includes Linux, Windows 2022, and Windows 2019 nodes, you will need two separate helm installations of aks-node-termination-handler, each with different values.

linux-windows2022.values.yaml ```bash priorityClassName: system-node-critical image: paskalmaksim/aks-node-termination-handler:latest affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.azure.com/os-sku operator: NotIn values: - Windows2019 ```
linux-windows2019.values.yaml ```bash priorityClassName: system-node-critical image: paskalmaksim/aks-node-termination-handler:latest-ltsc2019 nodeSelector: kubernetes.azure.com/os-sku: Windows2019 ```
# install aks-node-termination-handler for Linux and Windows 2022 nodes
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values=linux-windows2022.values.yaml

# install aks-node-termination-handler for Windows 2019 nodes
helm upgrade aks-node-termination-handler-windows-2019 \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values=linux-windows2019.values.yaml

Red Hat OpenShift support

For OpenShift clusters that use Azure computes for their nodes, you must enable pod hostNetwork support because OpenShift networking has a restriction for using Azure Metadata Service.

This support can be enabled with --set hostNetwork=true

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set hostNetwork=true