maksim-paskal / aks-node-termination-handler

Gracefully handle Azure Virtual Machines shutdown within Kubernetes
Apache License 2.0
34 stars 6 forks source link

Docker build fails with non existing path for directory #84

Closed silviuchiric closed 1 month ago

silviuchiric commented 1 month ago

Hello Maksim

we tried to build from Docker file but it fails with

lstat aks-node-termination-handler: no such file or directory

silviuchiric commented 1 month ago

![Uploading image.jpg…]()

maksim-paskal commented 1 month ago

The current build scenario relies on the goreleaser utility. To build a Docker image, you first need to build a binary. To do this, you must install go. After that, you can use the following command:

make build image=somedockeruser/somerepo

This will build a linux/amd64 binary and then push the Docker image to the specified repository.

silviuchiric commented 1 month ago

Thank you Maksim, appreciate quick reply

silviuchiric commented 1 month ago

Hello Manson

We finally deployed but the pods aks-node-termination-handler failed to start with ERROR Container has runAsNinRoot and image will run as root … Please see screenshot

silviuchiric commented 1 month ago

![Uploading image.jpg…]()

silviuchiric commented 1 month ago

Events:

Type Reason Age From Message


Normal Scheduled 28s default-scheduler Successfully assigned kube-system/aks-node-termination-handler-265zk to aks-platform2-41202490-vmss000006

Normal Pulled 28s kubelet Successfully pulled image "poc-container-registry.xxx.net/xxx/smartservices/images/aks-node-termination-handler:1.1-snapshot" in 204ms (204ms including waiting)

Normal Pulled 28s kubelet Successfully pulled image "poc-container-registry.xxx.net/xxx/smartservices/images/aks-node-termination-handler:1.1-snapshot" in 220ms (220ms including waiting)

Normal Pulling 3s (x4 over 28s) kubelet Pulling image "poc-container-registry.xxx.net/xxx/smartservices/images/aks-node-termination-handler:1.1-snapshot"

Warning Failed 2s (x4 over 28s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "aks-node-termination-handler-265zk_kube-system(f11cfa14-34fa-4be3-a754-91e646783a3d)", container: aks-node-termination-handler)

Normal Pulled 2s (x2 over 14s) kubelet Successfully pulled image "poc-container-registry.xxx.net/xxx/smartservices/images/aks-node-termination-handler:1.1-snapshot" in 180ms (180ms including waiting)

maksim-paskal commented 1 month ago

Hi, I don't see any screenshots you made. It seems that problem in your Dockerfile, it don't have instruction USER as original: https://github.com/maksim-paskal/aks-node-termination-handler/blob/7ced51db99ca3f3c9362be3f22aecbd65817d095/Dockerfile#L11

You can also customize helm installation with some other not root user as below:

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set securityContext.runAsUser=1000
silviuchiric commented 1 month ago

Prior to helm install shall I build from Dockerfile and push this image to our Nexus Repository ? This new build image shall I reference back into the values.yaml please ? On first line with key/tag image:

We can not deploy from GitHub , all images should go to internal Repo

silviuchiric commented 1 month ago

I fixed it by building the Docker image , push it to internal Nexus Repo and running helm update

I see all pods and daemon set as Running now Thanks a lot

maksim-paskal commented 1 month ago

If you and your team are not familiar with Docker, Helm, and Kubernetes, I recommend periodically making a copy of the latest image to your private repository using Docker:

docker pull paskalmaksim/aks-node-termination-handler:latest
docker tag paskalmaksim/aks-node-termination-handler:latest somehost.com/some/repo:latest
docker push somehost.com/some/repo:latest

And install to your kubernetes cluster with Helm:

helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set image=somehost.com/some/repo:latest

Nexus Repository can make automatically copy of paskalmaksim/aks-node-termination-handler:latest image to your internal repo with proxy feature:

https://help.sonatype.com/en/proxy-repository-for-docker.html
silviuchiric commented 1 month ago

One last question Maskim please We want to get the events for this particular endpoint only: 2017-11-01 General Availability Added Support for Spot VM eviction EventType ‘Preempt’ That’s is published by Microsoft and documented , I copied and pasted the line for our interest

Where to change and how to redeploy or update for this Endpoint update in helmcharts please

kind regards Silviu Chiric

silviuchiric commented 1 month ago

And the polling period shows up now as RequestTimeout 5000000000 Where is defined this time variables ? We simulated an eviction for one node but did not get the Eviction message in the logs

kubectl logs pod/aks-node-termination -n kube-system

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web/web.go:42","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web.Start","level":"info","msg":"web.address=:17923","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:70","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.(*Reader).ReadEvents","level":"info","msg":"Start reading events {\"Method\":\"GET\",\"Endpoint\":[http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01\](http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01%5C),\"RequestTimeout\":5000000000,\"Period\":5000000000,\"NodeName\":\"aks-spotcompute-18556904-vmss00001h\",\"AzureResource\":\"aks-spotcompute-18556904-vmss_53\"}","time":"2024-05-22T11:50:19Z"}

maksim-paskal commented 1 month ago
  1. Listen only Preempt event is now unavaible, this tool listen all events from Azure. Please create new issue and describe your problem, why you need to listen only one event - I will add this functionality.
  2. RequestTimeout is maximum time (5 seconds) for wait answer from metadata endpoint. This tool read metadata endpoint every Period (5 second).
silviuchiric commented 1 month ago

Thanky ou Maksim for above reply

Then how to test this service, the handler is working and getting the events? We found in MSFT docs the SPOT eviction simulation and do that simulation, see below, but got nothing in the pods logs. Actualy that events are static, since yestarday got recorded same ones, no updates

Anything wrong somewhere? I have asked MSFT Arhitect who recommended this service , waiting

How to get these messages, including SPOT nodes eviction?

Testing: [root@xdcf5d39771rlv4 aks-node-termination-handler]# POST https://management.azure.com/subscriptions/subscriptions/cf5d8b7e-bb50-409f-b0bc-de08f76ef1a6/resourceGroups/MC_risklab-aks-new_kdcf5d39771edev8_northeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-spotcompute-18556904-vmss/43/simulateEviction?api-version=2021-11-01

Please enter content (application/x-www-form-urlencoded) to be POSTed:

This is a test to test of events are captured in the logs of pods handler

Checking logs:

[root@xdcf5d39771rlv4 ~]# kubectl get pods -n kube-system -owide|grep -i aks-node-termination-handler

aks-node-termination-handler-2zcqz 1/1 Running 0 19h 10.244.0.65 aks-spotcompute-18556904-vmss000017

aks-node-termination-handler-489st 1/1 Running 0 19h 10.244.14.33 aks-platform3-39502874-vmss000006

aks-node-termination-handler-5jmqf 1/1 Running 0 19h 10.244.11.44 aks-compute1-29396086-vmss00003r

aks-node-termination-handler-cdzb6 1/1 Running 0 19h 10.244.7.42 aks-platform3-39502874-vmss000007

aks-node-termination-handler-dqdj6 1/1 Running 0 19h 10.244.3.216 aks-platform1-25078549-vmss00000b

aks-node-termination-handler-fhn8m 1/1 Running 0 19h 10.244.4.184 aks-platform2-41202490-vmss000006

aks-node-termination-handler-p56hz 1/1 Running 0 19h 10.244.10.135 aks-platform1-25078549-vmss00000i

aks-node-termination-handler-xcxd8 1/1 Running 0 19h 10.244.1.159 aks-compute3-30344420-vmss00005v

[root@xdcf5d39771rlv4 ~]# kubectl logs aks-node-termination-handler-2zcqz -n kube-system

{"file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:55","func":"main.main","level":"info","msg":"Starting 1.0.15-74dce44-1714558462...","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert/alert.go:29","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert.Init","level":"warning","msg":"not sending Telegram message, no token","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client/client.go:45","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client.Init","level":"info","msg":"No kubeconfig file use incluster","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web/web.go:42","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web.Start","level":"info","msg":"web.address=:17923","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:70","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.(*Reader).ReadEvents","level":"info","msg":"Start reading events {\"Method\":\"GET\",\"Endpoint\":[http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01\](http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01%5C),\"RequestTimeout\":5000000000,\"Period\":5000000000,\"NodeName\":\"aks-spotcompute-18556904-vmss000017\",\"AzureResource\":\"aks-spotcompute-18556904-vmss_43\"}","time":"2024-05-22T11:50:19Z"}

[root@xdcf5d39771rlv4 ~]# kubectl logs aks-node-termination-handler-489st -n kube-system

{"file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:55","func":"main.main","level":"info","msg":"Starting 1.0.15-74dce44-1714558462...","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert/alert.go:29","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert.Init","level":"warning","msg":"not sending Telegram message, no token","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client/client.go:45","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client.Init","level":"info","msg":"No kubeconfig file use incluster","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web/web.go:42","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web.Start","level":"info","msg":"web.address=:17923","time":"2024-05-22T11:50:19Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:70","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.(*Reader).ReadEvents","level":"info","msg":"Start reading events {\"Method\":\"GET\",\"Endpoint\":[http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01\](http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01%5C),\"RequestTimeout\":5000000000,\"Period\":5000000000,\"NodeName\":\"aks-platform3-39502874-vmss000006\",\"AzureResource\":\"aks-platform3-39502874-vmss_6\"}","time":"2024-05-22T11:50:19Z"}

[root@xdcf5d39771rlv4 ~]# kubectl logs aks-node-termination-handler-xcxd8 -n kube-system

{"file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:55","func":"main.main","level":"info","msg":"Starting 1.0.15-74dce44-1714558462...","time":"2024-05-22T11:50:20Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert/alert.go:29","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert.Init","level":"warning","msg":"not sending Telegram message, no token","time":"2024-05-22T11:50:20Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client/client.go:45","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client.Init","level":"info","msg":"No kubeconfig file use incluster","time":"2024-05-22T11:50:20Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web/web.go:42","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/web.Start","level":"info","msg":"web.address=:17923","time":"2024-05-22T11:50:20Z"}

{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:70","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.(*Reader).ReadEvents","level":"info","msg":"Start reading events {\"Method\":\"GET\",\"Endpoint\":[http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01\](http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01%5C),\"RequestTimeout\":5000000000,\"Period\":5000000000,\"NodeName\":\"aks-compute3-30344420-vmss00005v\",\"AzureResource\":\"aks-compute3-30344420-vmss_211\"}","time":"2024-05-22T11:50:20Z"}

maksim-paskal commented 1 month ago

try to simulate node eviction with Azure CLI

silviuchiric commented 1 month ago

Hello Maksim

It’s working indeed,

How we can change this notification from 5 seconds to 1 second only please

Kind regards Silviu Chiric

maksim-paskal commented 1 month ago

In our production clusters Azure endpoint sometime can't answer to this request quickly (1s) - it's recomended to be 5s - but if you want - try install this tool with:

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set 'args[0]=-period=1s'