maksim-paskal / aks-node-termination-handler

Gracefully handle Azure Virtual Machines shutdown within Kubernetes
Apache License 2.0
34 stars 6 forks source link

Windows nodes Support #67

Closed kubebn closed 5 months ago

kubebn commented 5 months ago

Hello,

We've been lucky so far while using AWS and aws-handler does support Windows nodes.

image

We do have some Windows Nodepools running in the AKS therefore, I am wondering if there are any plans for Windows support? Thanks

maksim-paskal commented 5 months ago

@kubebn Changes was released, please try to run this pods on Windows nodes (you need to reinstall the stable chart)

kubebn commented 5 months ago

Hi @maksim-paskal , today I was planning to test Windows spot instances. However, I am getting confused with configuration.

I have these values:

  image: paskalmaksim/aks-node-termination-handler:v1.0.12
  # imagePullPolicy: Always
  priorityClassName: system-node-critical
  securityContext:
    runAsNonRoot: true
    privileged: false
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL
    windowsOptions:
      runAsUserName: "ContainerUser"

  tolerations:
  - key: "kubernetes.azure.com/scalesetpriority"
    operator: "Equal"
    value: "spot"
    effect: "NoSchedule"
  - effect: NoSchedule
    key: windows
    operator: Equal
    value: "true"

I am getting:

aks-node-termination-handler-4mbxh   1/1     Running            0               21h     10.61.118.114   aks-lmd8spot1e4d-83777213-vmss00003j   <none>           <none>
aks-node-termination-handler-4ncjp   0/1     ErrImagePull       0               113s    10.61.66.36     aksw8s3e400001n                        <none>           <none>
aks-node-termination-handler-4nl89   1/1     Running            0               21h     10.61.116.237   aks-lmd8spot3e4d-27803674-vmss000035   <none>           <none>
aks-node-termination-handler-57rm6   0/1     ErrImagePull       0               113s    10.61.115.207   aksw8s2e400001o                        <none>           <none>
aks-node-termination-handler-58bg4   0/1     ErrImagePull       0               114s    10.61.102.177   aksw8s3e4000021                        <none>           <none>
---
k describe pod aks-node-termination-handler-57rm6
...
Containers:
  aks-node-termination-handler:
    Container ID:
    Image:         paskalmaksim/aks-node-termination-handler:v1.0.12
    Image ID:
    Port:          17923/TCP
    Host Port:     0/TCP
...
  Normal   Pulling          49s (x4 over 2m19s)  kubelet            Pulling image "paskalmaksim/aks-node-termination-handler:v1.0.12"
  Warning  Failed           49s (x4 over 2m19s)  kubelet            Error: ErrImagePull
  Normal   BackOff          23s (x7 over 2m19s)  kubelet            Back-off pulling image "paskalmaksim/aks-node-termination-handler:v1.0.12"

Is there anything else needs to be added so it can read Windows image manifest correctly?

maksim-paskal commented 5 months ago

@kubebn in production we don't have any Windows server, for my test I create simple cluster with Windows and Linux nodes, see README

Your logs doesn't have any reason why it not pull image, maybe your Windows nodes have some specific network settings, or it's some specific instance error....

Please try to create new AKS cluster (see README) and try to install aks-node-termination-handler in this cluster with default helm chart settings:

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical

and than install chart with your own values.yaml

kubebn commented 5 months ago

I tried to install it straight using windows images: paskalmaksim/aks-node-termination-handler:v1.0.12-windows-amd64

Got this error message:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem 182f8aea9ccc95e5b750a40a9e3a63bf0188ed4f2bfaff499a3052b25bfe4265: The container operating system does not match the host operating system.: unknown
      Exit Code:    128

Is it actually compatible with Windows 2019?

  OS Image:                   Windows Server 2019 Datacenter
  Operating System:           windows
  Architecture:               amd64
maksim-paskal commented 5 months ago

@kubebn it's some kubernetes windows specific error, more info here it means that docker image that build for Windows 2022 can't start on Windows 2019, and vice versa, it can be fixed only with different docker images for specific Windows version.

I see that AKS clusters have Windows 2022 by default

Windows Server 2022 is the default operating system for Kubernetes versions 1.25.0 and higher. Windows Server 2019 is the default OS for earlier versions.

I build test images for your test, you can change image for your pods to check if it close your issues: Windows 2022: paskalmaksim/aks-node-termination-handler:test-7772698645-windows-ltsc2022-amd64 Windows 2019: paskalmaksim/aks-node-termination-handler:test-7772698645-windows-ltsc2019-amd64

Can you migrate your workflows from Windows 2019 to Windows 2022? What Operation Systems you cluster have (Linux + Windows 2019 or Linux + Windows 2019 + Windows 2022) ?

kubebn commented 5 months ago

Hi, yes we are aware that 2019 will be deprecated soon but unfortunately can’t migrate all of them now.

I will try those images on Monday, I guess I will just create two daemonsets for diff versions.

We have Linux and both Windows versions.

maksim-paskal commented 5 months ago

There is more elegant way to run pods in your landscape Linux + Windows 2019 + Windows 2022 - you need two installation of aks-node-termination-handler:

values.yaml of first installation (exclude Windows 2019 nodes)

priorityClassName: system-node-critical

image: paskalmaksim/aks-node-termination-handler:latest

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.azure.com/os-sku
          operator: NotIn
          values:
          - Windows2019

values.yaml of second installation (only Windows 2019 nodes)

priorityClassName: system-node-critical

image: paskalmaksim/aks-node-termination-handler:latest-ltsc2019

nodeSelector:
  kubernetes.azure.com/os-sku: Windows2019

It's my proof of concept for new release, I try to implement this on this week

maksim-paskal commented 4 months ago

@kubebn Windows 2019 now has support, see readme