maksim-paskal / aks-node-termination-handler

Gracefully handle Azure Virtual Machines shutdown within Kubernetes
Apache License 2.0
34 stars 6 forks source link

OpenShift Support #75

Closed jsanchezmartinez closed 3 months ago

jsanchezmartinez commented 4 months ago

Hi, When running in OpenShift, there are no VirtualMachineScaleSets (only VirtualMachines), and for that reason, the DaemonSet is crashing (attached logs below). Can we request for OpenShift support?

{"file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:55","func":"main.main","level":"info","msg":"Starting 1.0.13-d8d5a71-1707463489...","time":"2024-03-07T10:48:45Z"}
{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert/alert.go:29","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert.Init","level":"warning","msg":"not sending Telegram message, no token","time":"2024-03-07T10:48:45Z"}
{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client/client.go:45","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client.Init","level":"info","msg":"No kubeconfig file use incluster","time":"2024-03-07T10:48:45Z"}
{"error":"error in getting azure resource name: azure:///subscriptions/dd6b40ef-de5f-4649-95a7-bd2337c71900/resourceGroups/ocp-azure-uat-euw-8npmn-rg/providers/Microsoft.Compute/virtualMachines/master-1: azureProviderID not valid","file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:86","func":"main.main","level":"fatal","msg":"","time":"2024-03-07T10:48:45Z"}
maksim-paskal commented 4 months ago

@jsanchezmartinez thanks for opening this issue. By default this tool use AKS (as a Azure service) nodes. All nodes that AKS creates have .spec.providerID that corresponds to this format https://github.com/maksim-paskal/aks-node-termination-handler/blob/d8d5a71ab0612096d46acab1a6a4c24b454619b5/pkg/api/api.go#L37

I never use OpenShift, but theoretically this tool can works on OpenShift nodes. Can you share some recipe how to build OpenShift cluster on Azure?

I will try to create cluster on Azure, and try to run this tool on this nodes.

jsanchezmartinez commented 4 months ago

Hi @maksim-paskal, Easiest way is using ARO service (https://azure.microsoft.com/en-us/products/openshift/ and https://portal.azure.com/?feature.msaljs=true#view/HubsExtension/BrowseResource/resourceType/Microsoft.RedHatOpenShift%2FOpenShiftClusters). In OpenShift there are no VirtualMachineScaleSets (only VMs), so the Azure provider ID is a bit different: "^azure:///subscriptions/(.+)/resourceGroups/(.+)/providers/Microsoft.Compute/virtualMachines/(.+)$"

maksim-paskal commented 4 months ago

@jsanchezmartinez can you reboot OpenShift server from Azure Portal (for example worker node) ?

I create OpenShift cluster on Azure - but all operation with server are forbidden for my user (restriction I think was set while creating OpenShift cluster on resource group with OpenShift servers).

I don't know how to test my changes.

jsanchezmartinez commented 4 months ago

OpenShift VMs cannot be restarted when using Azure ARO. I can restart in some of our clusters, because are self managed. Why do you need to restart a VM from OpenShift to test the changes?

maksim-paskal commented 4 months ago

aks-node-termination-handler listen all events from Azure Scheduled Events and reboot is one of events that Azure sends...

How you plan to use this tool? Are you plan to use Azure Spot?

jsanchezmartinez commented 4 months ago

Yes. We are currently running spot instances and we want mainly to drain nodes when eviction events are detected. Maybe you can test simulating eviction events through Azure API: https://learn.microsoft.com/en-us/rest/api/compute/virtual-machines/simulate-eviction?view=rest-compute-2023-10-02&tabs=HTTP

maksim-paskal commented 4 months ago

How can I add to OpenShift cluster spot instances?

jsanchezmartinez commented 4 months ago

Basically, you have to pick an existing worker MachineSet, copy/paste it and adapt (https://learn.microsoft.com/en-us/azure/openshift/howto-spot-nodes). This is the important part to be added:

      providerSpec:
        value:
          spotVMOptions: {}
maksim-paskal commented 4 months ago

Simulation API is not available for OpenShift servers, there are some restrictions on servers resource group . If you have OpenShift cluster with Spots, you can test my change in your cluster.

OpenShift clusters have some restriction for pods that want to connect to 169.254.169.254, aks-node-termination-handler needs access to that address for reading events. You must enable this installing chart with --set hostNetwork=true

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
https://github.com/maksim-paskal/aks-node-termination-handler/releases/download/v1.0.13/aks-node-termination-handler-1.1.5.tgz \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:dev \
--set hostNetwork=true

If you can test it, this will be awesome - after test I will release that feature.

jsanchezmartinez commented 3 months ago

I'll try to test today or next Monday and come back. Thanks :)

jsanchezmartinez commented 3 months ago

Seems to be working fine (see attached logs screenshot). Do you need/want anything else to validate?

image

maksim-paskal commented 3 months ago

Let's watch. If something wrong happens, please describe that issue. I will release these changes next week (Thursday), if no issues are found.

maksim-paskal commented 3 months ago

This changes was released, please swith your dev installation to production

helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set hostNetwork=true