kubernetes-sigs / sig-windows-tools

Repository for tools and artifacts related to the sig-windows charter in Kubernetes. Scripts to assist kubeadm and wincat and flannel will be hosted here.
Apache License 2.0
123 stars 123 forks source link

Windows node with ContainerD can't run flannel and kubeproxy daemonsets #128

Closed vitaliy-leschenko closed 2 years ago

vitaliy-leschenko commented 3 years ago

Describe the bug Flannel and kube-proxy pods do not start.

To Reproduce

  1. setup on-premise cluster (v1.19.0) with flannel (vxlan)
    • flannel image: vleschenko/flannel:0.13.0 (it is support WS2004)
    • kubeproxy image: vleschenko/kube-proxy:v1.19.0 (it is support WS2004)
  2. setup Windows Server Standard 2004
  3. install containerd
  4. prepare node
  5. kubeadm join

Node successfully joined to cluster:

PS C:\Users\v.leschenko> kubectl get nodes -owide
NAME           STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION       CONTAINER-RUNTIME
k8s            Ready    master   241d   v1.19.0   192.168.2.20   <none>        Ubuntu 18.04.5 LTS        4.15.0-123-generic   docker://19.3.12
k8s-us1804-a   Ready    <none>   241d   v1.19.0   192.168.2.25   <none>        Ubuntu 18.04.5 LTS        4.15.0-123-generic   docker://19.3.12
k8s-ws2004-a   Ready    <none>   47m    v1.19.0   192.168.2.30   <none>        Windows Server Standard   10.0.19041.630       containerd://1.4.1

Pods stuck on ContainerCreating

kube-flannel-ds-windows-amd64-zclk9   0/1     ContainerCreating   0          18m     192.168.2.30   k8s-ws2004-a   <none>           <none>
kube-proxy-windows-gqjnh              0/1     ContainerCreating   0          18m     <none>         k8s-ws2004-a   <none>           <none>

Error message:

  Warning  FailedCreatePodSandBox  4m22s (x63 over 17m)  kubelet, k8s-ws2004-a  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fe980b773114549b7144eaa3bbd44f277394ad7be669697a442141372da3f29c": error creating endpoint hcnCreateEndpoint failed in Win32: IP address is either invalid or not part of any configured subnet(s). (0x803b001e) {"Success":false,"Error":"IP address is either invalid or not part of any configured subnet(s). ","ErrorCode":2151350302} : endpoint config &{ fe980b773114549b7144eaa3bbd44f277394ad7be669697a442141372da3f29c_nat 8f77950d-c85e-4b74-b9dc-61ce5e671d3a  [] [{ 0}] { [.] [127.0.0.1] []} [{192.168.2.30 0.0.0.0/0 0}]  0 {2 0}}

Expected behavior flannel and kube-proxy started

Kubernetes:

lippertmarkus commented 2 years ago

@jonaskello you get a "nat network already exists error", that's different to what OP got. Try to remove the already existing one:

Get-HNSNetwork | Remove-HNSNetwork
jonaskello commented 2 years ago

@lippertmarkus Yes you are correct, when we switched to hostprocess it is a different error from OP. Actually we got one step further by deleting the cni config file created by Install-Containerd.ps1 like you do here:

https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess/blob/main/setup-scripts/winworker.ps1#L21

Now flannel and kube-proxy starts and the workload container gives error about invalid architecture on Server 2022. Will try it on 2019 also.

K31D commented 2 years ago

@lippertmarkus Yes you are correct, when we switched to hostprocess it is a different error from OP. Actually we got one step further by deleting the cni config file created by Install-Containerd.ps1 like you do here:

https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess/blob/main/setup-scripts/winworker.ps1#L21

Now flannel and kube-proxy starts and the workload container gives error about invalid architecture on Server 2022. Will try it on 2019 also.

Hi @jonaskello where you able to make it work, I am facing the same issue with ContainerD and Flannel on Windows Node

jonaskello commented 2 years ago

@K31D Yes we have it working now.

Although it is not possible to run 2019 images on Server 2022 worker nodes but I found out this is a limitation of windows itself.

K31D commented 2 years ago

@jonaskello may I ask you how or what was the problem ? Cause I am trying with WS2019+ContainerD 1.5.9 and Flannel (vxlan)

jonaskello commented 2 years ago

@K31D If I recall correctly deleting the CNI config as per my comment above solved it for us.

jsturtevant commented 2 years ago

@K31D if you are using the hostprocess containers you need to use containerd 1.6+ and k8s 1.22+ with the feature flag enabled https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#before-you-begin

K31D commented 2 years ago

Thank you @jsturtevant, no they are "regular" containers. Unfortunately, pods are keeping getting the Nat network instead of the Pod Cidr. We want to try by either using DockerD or WS2022, we are forced to use Vxlan cause we are on Azure IaaS VM.

jsturtevant commented 2 years ago

yea the scripts for containerd had a bug that wasn't fixable with this solution. The only way to get flannel working with containerd is to run flanneld as a windows service on the host and only configure the flannel network not the nat config that the scripts set up.

we are forced to use Vxlan cause we are on Azure IaaS VM.

There is no reason to only use vxlan here. in AKS and capz the vms created as Nodes are IaaS VMs. We have gotten all of the various cni's working including ones that use l2bridge configurations.

K31D commented 2 years ago

thanks again @jsturtevant I've tried to find a doc on how to install it as a service, the only guide I found is this one but it's 3yrs old also the official guide on how to add windows node was deleted from kubernetes doc

jsturtevant commented 2 years ago

we know this type of installation guide is missing, it has been challenging to create one due to differences in setups/networks and lack of parity for things like privileged containers. We have discussed as a sig-windows to potentially fill this gap with some tutorials but have limited folks to dedicate to it.

That basics in those scripts you link are essentially still the same as they were 3 years ago unless you are trying to use cni's and kubeproxy as hostprocess containers. Calico also has some good docs and scripts that mostly just work: https://projectcalico.docs.tigera.io/getting-started/windows-calico/quickstart

jonaskello commented 2 years ago

For anyone interested I made a gist with the scripts that we have working to install windows worker nodes on our on-prem cluster that was setup using kubeadm. We use this on freshly installed WS2019 Core machines. It uses ContainerD and flannel with host process containers. I think I got the original scripts from a gist @lippertmarkus had and then we tweaked them a bit.

K31D commented 2 years ago

Thank you again both! You have been very helpful, will try the @jonaskello script

jsturtevant commented 2 years ago

@jonaskello @lippertmarkus We have a need for improving our getting started for windows: https://github.com/kubernetes-sigs/sig-windows-tools/issues/217

Would either of you be interested in helping improve our getting started story with the work you've already done?

lippertmarkus commented 2 years ago

@jsturtevant sure. Next to the sources you listed in https://github.com/kubernetes-sigs/sig-windows-tools/issues/217#issuecomment-1192804934 there's also https://github.com/microsoft/Windows-Containers/tree/Main/helpful_tools by @brasmith-ms

I think it would be better to decide on one installation method we want to provide as "official" and then add all features and configuration options to that. I created the containerd-installer because we wanted an easy way to install containerd and PowerShell scripts can't be used for a WinGet target. So if we want to move forward with that installer, we could transfer the repo to sig-windows and I'm happy to create a WinGet package for it. The same could be done fore kubelet and kubeadm.

Installation could look like

winget install containerd --override "--include-cni"
winget install kubelet  # installs kubelet as a service
winget install kubeadm

Related: https://github.com/microsoft/winget-cli/discussions/2361#discussioncomment-3211512

For other CNIs and kube-proxy we should work on making the installation via hostprocess containers the official path as that's similar to how it works on Linux and we wouldn't need windows specific docs.

ghbeta commented 1 year ago

@K31D do you manage to make the cluster work? with script from @jonaskello i get an error from kube-proxy-windows pod due to missing cni configuration, the crictl info also report missing configuration. maybe i am missing something here

K31D commented 1 year ago

@ghbeta yes, unfortunately, it was not me but a colleague. They said they just followed @jonaskello instructions. Our final environment was 3 control plane ubuntu 18.04 2 windows server 2019 (patched to the most recent configuration) + Flannel as CNI

Krishnankk commented 1 year ago

@K31D @ghbeta Hi, I have followed the script of @jonaskello and my windows worker node successfully connected to the linux master (CentOS7). pods are successfully running in the cluster. but i am not able to access via NodePort service. connection is refusing.

Any idea ?

DanSibbernsen commented 1 year ago

For anyone having a similar issue while running containerd and using host-gw flannel, I found that restarting the box will cause the containerd network to drop out of scope. This can be remedied with these 2 steps:

  1. New-HnsNetwork -Type NAT -Name nat - create the missing network, as NAT is somehow deleted during reboot
  2. Get-HnsEndpoint | ? { $_.Name -eq 'cbr0_ep' } | Remove-HnsEndpoint - remove an endpoint that I think was still tied to the old NAT. Step 2 I got from #94 after getting the error message rpc error: code = Internal desc = could not create IP forward entry: The object already exists.