Closed vitaliy-leschenko closed 2 years ago
@marosset could you help with it?
After VM reboot in kubelet logs I can see:
E1127 10:50:46.560757 3224 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to setup network for sandbox "976a7614895c88721d7e5e45d598643dc90f4bd3fc9ad8fb6184374f2b4dfde1": hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}
E1127 10:50:46.560757 3224 kuberuntime_sandbox.go:69] CreatePodSandbox for pod "kube-flannel-ds-windows-amd64-zclk9_kube-system(d168e243-716c-47b4-8ae5-a5d01399ac1c)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "976a7614895c88721d7e5e45d598643dc90f4bd3fc9ad8fb6184374f2b4dfde1": hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}
E1127 10:50:46.560757 3224 kuberuntime_manager.go:730] createPodSandbox for pod "kube-flannel-ds-windows-amd64-zclk9_kube-system(d168e243-716c-47b4-8ae5-a5d01399ac1c)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "976a7614895c88721d7e5e45d598643dc90f4bd3fc9ad8fb6184374f2b4dfde1": hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}
E1127 10:50:46.560757 3224 pod_workers.go:191] Error syncing pod d168e243-716c-47b4-8ae5-a5d01399ac1c ("kube-flannel-ds-windows-amd64-zclk9_kube-system(d168e243-716c-47b4-8ae5-a5d01399ac1c)"), skipping: failed to "CreatePodSandbox" for "kube-flannel-ds-windows-amd64-zclk9_kube-system(d168e243-716c-47b4-8ae5-a5d01399ac1c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-flannel-ds-windows-amd64-zclk9_kube-system(d168e243-716c-47b4-8ae5-a5d01399ac1c)\" failed: rpc error: code = Unknown desc = failed to setup network for sandbox \"976a7614895c88721d7e5e45d598643dc90f4bd3fc9ad8fb6184374f2b4dfde1\": hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {\"Success\":false,\"Error\":\"The object already exists. \",\"ErrorCode\":2147947410}"
ipconfig:
Windows IP Configuration
Ethernet adapter Ethernet:
Connection-specific DNS Suffix . : vitaliy.org
IPv6 Address. . . . . . . . . . . : 2a03:e2c0:1801:ff00:2551:7e41:8ae0:4fc6
IPv6 Address. . . . . . . . . . . : 2a03:e2c0:1801:ff00:8c2c:2f9:601a:d106
Link-local IPv6 Address . . . . . : fe80::8c2c:2f9:601a:d106%7
IPv4 Address. . . . . . . . . . . : 192.168.2.30
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : fe80::215:5dff:fe02:903%7
192.168.2.254
I'll take a look
@marosset I found reason why pod doesn't start: After node restart nat
network has been removed.
So, I modified StartKubelet.ps1 to recreate it like for docker. Please see this script from PR #132
Now pod starts but use wrong network: nat instead of flannel
k8s-ws1809-[a-c] - docker nodes k8s-ws1809-d - containerd node.
You can see that containerd pods use nat
172.27.0.0/16 network instead flannel
of 10.244.0.0/16.
192.168.2.0/24 - my local network
@vitaliy-leschenko I wonder if there is a race between when kube-proxy and flannel-ds pod is able to start the flannel service. On docker machines I observed that flannel would remove NoSchedule taints on machines to allow for other pods like kube-proxy to be scheduled. I don't remember seeing seeing this behavior on nodes running contianerd.
Most of my experience with containerd is with setting up CNI during node configuration time, not with having CNI plugins being deployed with via pods.
@daschott do you have any insight here?
flannel pod has the same behaviour for docker and cotnainerd nodes.
If kube-proxy pod is rescheduled after flannel is started does it join the correct (flannel) network? I'm setting up a new cluster to verify this myself.
I think I remember docker nodes getting joined to the cluster in a NotReady state then the flannel service would update the node's status setting it to Ready but cointainerd nodes would get joined in the Ready state.
yep. even if kube-proxy (or any other pods) started after flannel it also starts with NAT ip address.
I'm able to repro this. Still looking at the issue tho...
Sorry for the slow responses here. I've been digging into this and have a lot more context but do not yet have a solution.
When hostNetwork is true kubelet/dockershim adds the containers to an existing network named host
and skips CNI specific config (created by https://github.com/kubernetes-sigs/sig-windows-tools/blob/9aa36e43ef71947b263464b3e657e50340769315/kubeadm/scripts/PrepareNode.ps1#L77)
It looks like containerd for Windows does not have the same behavior https://github.com/containerd/containerd/blob/88f089354009d3df6d5556d59ce4ce2ac0717106/pkg/cri/server/sandbox_run.go#L111-L117
I also ran some testing and locally where I configured flannel on the node before joining the node to a cluster (installed / ran flanneld and added a flannel CNI config in /etc/cni/net.d) and then did not schedule a Windows flannel ds to the node and everything worked as expected.
I think some possible solutions are to
All of the upstream tests we run against kubernetes/kubernetes install CNI as part of node preparation (for both Azure and GKE) and unfortunately these tutorials missed this behavior.
Describe the bug Flannel and kube-proxy pods do not start.
To Reproduce
- setup on-premise cluster (v1.19.0) with flannel (vxlan)
- flannel image: vleschenko/flannel:0.13.0 (it is support WS2004)
- kubeproxy image: vleschenko/kube-proxy:v1.19.0 (it is support WS2004)
- setup Windows Server Standard 2004
- install containerd
- prepare node
- kubeadm join
Node successfully joined to cluster:
PS C:\Users\v.leschenko> kubectl get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s Ready master 241d v1.19.0 192.168.2.20 <none> Ubuntu 18.04.5 LTS 4.15.0-123-generic docker://19.3.12 k8s-us1804-a Ready <none> 241d v1.19.0 192.168.2.25 <none> Ubuntu 18.04.5 LTS 4.15.0-123-generic docker://19.3.12 k8s-ws2004-a Ready <none> 47m v1.19.0 192.168.2.30 <none> Windows Server Standard 10.0.19041.630 containerd://1.4.1
Pods stuck on ContainerCreating
kube-flannel-ds-windows-amd64-zclk9 0/1 ContainerCreating 0 18m 192.168.2.30 k8s-ws2004-a <none> <none> kube-proxy-windows-gqjnh 0/1 ContainerCreating 0 18m <none> k8s-ws2004-a <none> <none>
Error message:
Warning FailedCreatePodSandBox 4m22s (x63 over 17m) kubelet, k8s-ws2004-a (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fe980b773114549b7144eaa3bbd44f277394ad7be669697a442141372da3f29c": error creating endpoint hcnCreateEndpoint failed in Win32: IP address is either invalid or not part of any configured subnet(s). (0x803b001e) {"Success":false,"Error":"IP address is either invalid or not part of any configured subnet(s). ","ErrorCode":2151350302} : endpoint config &{ fe980b773114549b7144eaa3bbd44f277394ad7be669697a442141372da3f29c_nat 8f77950d-c85e-4b74-b9dc-61ce5e671d3a [] [{ 0}] { [.] [127.0.0.1] []} [{192.168.2.30 0.0.0.0/0 0}] 0 {2 0}}
Expected behavior flannel and kube-proxy started
Kubernetes:
- Windows Server version: Windows Server 2004
- Kubernetes Version: 1.19.0
- CNI: ContainerD/1.4.1
Have you solved it, how did you solve it
Hi, I am running into a similar issue. Any updates?
The current way around this is to install your CNI of choice on the node after the prepare script, then you can run kubeproxy as a deamonset.
We are looking into Windows privileged container support (which comes with host network support). This is a topic for the next sig-windows meeting if you want to join and discuss further: https://docs.google.com/document/d/1Tjxzjjuy4SQsFSUVXZbvqVb64hjNAG5CQX8bK7Yda9w/edit#heading=h.3f5dhus4q8i2
okay. Thanks I will try that route.
@vitaliy-leschenko Hi dude, I am have this same problem now with my win server 2019, after restart the machine they report the same error with you, could you show where you change the StartKubelet.ps1, in my case I am using just the containerd without docker you know because k8s will not suporte docker.
@jaderoliver Hi, I have not solved the issue yet. I'm going to try write scripts to setup containerd as a service instead of run it as a pods. Maybe it will works better.
I wasn't able to get it to work too with a manual installation of flannel and daemonset kube-proxy. Although things deployed, I wasn't able to do cross node communication between pods.
@sl844 did you install flannel as a service? We have seen success with install containerd and flannel as a service: https://github.com/kubernetes-sigs/sig-windows-tools/issues/128#issuecomment-744840966 instead of running them as pods
I ran it with the command flanneld -- kubeconfig-file=c:\k\config --iface=server_ip --ip-masq=1 --kube-subnet-mgr=1 -v=3. Orginally, I had tried to run it following the run script in the flannel yaml file, but I got errors about setting the pod_name and namespace. When I manually set that, it looks for the pod with respect to the pod_name which isn't deployed. Are there steps to installing it as a service?
@sl844 When running directly on node, you have to set NODE_NAME
env var instead. For example:
$env:NODE_NAME=$(hostname).toLower()
When I set the environment variable to use node_name it works, but cross node communication between pods don't work. I tried this twice but without using nssm.
What do flannelD logs print out?
I don't recall it. I will try to set it up as a service and get the logs.
https://drive.google.com/drive/folders/14G4x-QvtB6fhZ7f0eAD0JwalKEZV29VO You can find the flannel output, picture of kube-proxy with IP, Picture of pods deployed on windows and also flannel conf in the link above.
Steps:
@sl844 It looks like the Windows app (signup-web) is attached to NAT network instead of Flannel network. Ie it is using the network that is intended for the daemonsets, not for workloads. This NAT network will not allow for such communication between hosts. Can you switch kubelet over to use the Flannel network config?
Is the switch the --network-plugin=cni in the StartKubelet.ps1 script? That is already set. I remember I use to set node-ip in the startkubelet.ps1 script because the node gets assigned the cni network. Currently my node is assigned 10.244.2.2 IP.
There seems to be another CNI config currently being used pointing to NAT CNI plugin. The CNI config gets pointed to in different places:
ContainerD will point to a CNI config in its config.toml
Kubelet has a parameter cni-conf-dir
If there are multiple CNI configuration files in the directory, the kubelet uses the configuration file that comes first by name in lexicographic order. The default path to this directory is c:/etc/cni/net.d
.
You likely have another NAT CNI configuration file that is still present from some of the setup used for the DaemonSet workaround which is taking higher precedence. Can you check the c:/etc/cni/net.d
directory?
Also keep in mind you should restart Kubelet after any CNI config changes. For full re-read after CNI config change:
<kubectl delete pods on problematic node>
Stop-Service Kubelet
Restart-Service ContainerD
Start-Service Kubelet
<wait for node to report as ready>
<reschedule pods>
You are right, I have two files. One is created by the Install-Container.ps1 script. The other one I followed the run script on flannel.yaml and placed it in there. By default both containerd and kubelet uses the same directory for cni. I moved the flannel.conf out of the default path and added the --cni-conf-dir=new directory in teh startkubelet script. Stopped kubelet, restarted containerd, and started kubelet. I get node on cni network, proxy on nat network, and workloads on nat network still.
I am having similar issues.
Added the following to StartKubelet.ps1
Import-Module "c:\k\hns.psm1"
New-HnsNetwork -Type NAT -Name nat
And with flannel
and kube-proxy
running as daemonsets. They both join the wrong network.
An interesting thing I noticed is if I reschedule flannel
(i.e. delete the pod), then flannel
gets the right IP for a second and then swhiches over to the wrong one (NAT?).
I would like to try to run flannel as a service - can someone help me out with some docs/step by step guide?
I have faced the same issue as in the previous message: flannel
initially gets the correct IP, and then falls back to an invalid one from the NAT network. Kube-proxy-wondows
is started, but also gets an IP from the NAT network. Client PODs can be scheduled, but all of them obtain IPs from NAT.
Tried to separate network configurations for kubelet
and containerd
, but without success. With removing 0-containerd-nat.json
, PODs can obtain IPs from flannel network, but DaemonSets become configured incorrectly.
Can someone get more information on how to run flannel
standalone, switching from DaemonSet?
Hi @dfateyev,
before DaemonSets we used this scripts to join Windows node to a cluster. https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/kubeadm/KubeCluster.ps1 with configs: https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/kubeadm/v1.16.0/Kubeclusterbridge.json or https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/kubeadm/v1.16.0/Kubeclustervxlan.json
When we use this way, we have kubelet, flannel and kube-proxy as 3 windows services. Maybe it can help you to setup containerd. My attempts have been failed.
anybody got this working in the meantime?
Have you check the config file of containerd.
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "C:/opt/cni/bin"
conf_dir = "C:/etc/cni/net.d"
conf_template = ""
max_conf_num = 0
max_conf_num is the number of configuration files loaded at startup by containerd, the default value is 1. You must have the file 0-containerd-nat.json for containerd. Then you need to have a config file for flannel If you put max_conf_num at 0 you solve your problem. The both file wiil be loaded
@fredericpougnault thanks for the suggestion. I just tried setting max_conf_num = 0
but I'm still getting
Warning FailedCreatePodSandBox 2m13s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "281e81a3d54f574140123ddce86e7a9d325658a1cefc1edab410f6ee8aedc4ba": hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}
(like the other people above)
When I then manually create the NAT network with
Import-Module .\hns.psm1
New-HnsNetwork -Type NAT -Name nat
I get:
Warning FailedCreatePodSandBox 10s (x9 over 118s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "06cc180b67b3feedba562c0e578003e416173f74101a9df412e759ab1c8751f6": error creating endpoint hcnCreateEndpoint failed in Win32: IP address is either invalid or not part of any configured subnet(s). (0x803b001e) {"Success":false,"Error":"IP address is either invalid or not part of any configured subnet(s). ","ErrorCode":2151350302} : endpoint config &{ 06cc180b67b3feedba562c0e578003e416173f74101a9df412e759ab1c8751f6_nat 51de8be0-dc00-424b-878d-c78afc04d88c [] [{ 0}] { [default.svc.cluster.local svc.cluster.local cluster.local] [10.96.0.10] [ndots:5]} [{10.1.0.101 0.0.0.0/0 0}] 0 {2 0}}
which sounds like it tries to use the NAT network.
Maybe @FriedrichWilken or @jayunit100 can help as you got Calico to work in https://github.com/kubernetes-sigs/sig-windows-dev-tools? While working on the vagrant setup, did you overcome something like that as well?
@knabben is the expert on Calico
I have tried for a week all the different approach listed here and I run into the same problem. kube-proxy daemonset and worker nodes get nat IP. Any more suggestions?
@HimanshuZinzuwadia Can you try deleting all CNI configs except calico one? For me this helped "forcing" the pods to use that instead of nat
I am trying with flannel. This issue is with flannel. Is switching to calico the only solution ?
Installing flannel/calico/(any cni) as services on the host is our current working solution for containerd: https://github.com/kubernetes-sigs/sig-windows-tools/issues/128#issuecomment-780113110 and https://github.com/kubernetes-sigs/sig-windows-tools/issues/128#issuecomment-744840966
The long term plan is to use hostprocess for the CNI's. We have examples of how this will work in https://github.com/kubernetes-sigs/sig-windows-tools/tree/master/hostprocess. See https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1672 for full e2e setup. Note that hostprocess is in alpha so we are looking for feedback on the solutions.
Is this correct to summarize the comments above. I will try this and update.
Option 1
Do not run the Daemonsets for flannel (flannel-overlay or flannel-host-gw)
Clean up all folders and all virtual networks created by previous attempts.
Run Install-ContainerD.ps1 and PrepareNode.ps1 as per instructions for adding windows node
Then change C:\Program Files\containerd\config.toml to ensure max_conf_num=0 so that it will load more than one config file.
Download and ensure flannel version 0.12 in the install.ps1 provided by the flannel cni scripts.
Run Start.ps1 to Install flannel but exit before it actually joins cluster or runs flannel
Copy the flannel cni.conf from C:\k\cni\config to in c:\etc\cni\net.d so as add flannel CNI config to containerd cni config files as per the comments mentioned above.
Restart Containerd
Join node to cluster.
Install Flannel As a service and start
Run kube-proxy for windows as daemonset or you can run it as service on windows node also.
If after node restart nat network is removed then install it again with
Import-Module .\hns.psm1
New-HnsNetwork -Type NAT -Name nat
Option 2 use hostprocess that is in alpha.
Sad to report that even with following all the above and trying various approaches, could not get containerd and flannel to work together. Ended with same issue as this one. https://issueexplorer.com/issue/kubernetes-sigs/sig-windows-tools/102 Update after ruined weekend. May be I was close to solution but can't spend anymore cycles. Conclusion containerd, microsoft.sdn provided scripts for flannel and kubernetes do not work well together in current state. I did not try to use the bridge network because we must use overlay network due to network requirements.
Reverting to Installing Docker Enterprise and and PrepareNode.ps1 instructions under Docker tab in below URLs. https://docker-docs.netlify.app/install/windows/docker-ee/ https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/adding-windows-nodes/ and
I have published an example Vagrant setup with Hyper-V using Calico and kube-proxy HostProcess pods for Windows nodes: https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess
It's using plain Ubuntu/Windows Server 2022 boxes and the setup scripts are super simple and are only using official resources: https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess/tree/main/setup-scripts
Thought that might help some of you with your setup and the various problems posted here
@lippertmarkus that's great! Have you seen the devbox work the sig has been working on? https://github.com/kubernetes-sigs/sig-windows-dev-tools
We have a task to integrate the hostprocess pods https://github.com/kubernetes-sigs/sig-windows-dev-tools/issues/123. Is that something you would be willing to help out with?
CC: @FriedrichWilken @knabben
edit: I see you already commented on the issue! :-)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Was someone able to make Container + (flannel and kubeproxy daemonsets) work?
Those two ways should work flawlessly: https://github.com/kubernetes-sigs/sig-windows-dev-tools https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
I've tried flannel + kubeproxy deamonsets using hostprocess containers (containerd 1.6.4) but still no luck. Tried both on Windows Server 2019 and 2022. The error is still as reported above:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8e7c5b8d5569ba1c7dcda6807a1a0d3e10d3d9c6ff4873b3c5bd3ddbb460a8d4": plugin type="nat" name="nat" failed (add): hcnCreateNetwork failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}`
Has anyone gotten flannel working with containerd in any configuration (non-hostprocess containers, hostprocess containers, services)?
Describe the bug Flannel and kube-proxy pods do not start.
To Reproduce
Node successfully joined to cluster:
Pods stuck on ContainerCreating
Error message:
Expected behavior flannel and kube-proxy started
Kubernetes: