windows pods are not reachable on a hybrid Linux/ windows cluster

kubernetes-sigs / sig-windows-tools

Repository for tools and artifacts related to the sig-windows charter in Kubernetes. Scripts to assist kubeadm and wincat and flannel will be hosted here.

Apache License 2.0

126 stars 123 forks source link

windows pods are not reachable on a hybrid Linux/ windows cluster #103

Open llyons opened 4 years ago

llyons commented 4 years ago

Hi,

we have a custom "baremetal" K8 cluster with 2 linux nodes and 1 windows node. Just trying to get familiar with how it works, etc. We have the nginx ingress controller running and the metalLB loadbalancer running. Both of the ingress and loadbalancer controllers only run on the linux nodes and master. They dont run on the windows node (i was told this wasnt needed). We have deployed a number of linux containers running on the linux nodes and they work. the 2 windows containers we have running on the windows node start and are running but are not reachable on clusterIP or the provisioned serviceip. The containers are running on the windows node since I can see those and even docker exec -it into them. Doing a kubectl get svc i have these values for the clientportal app.

clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 21h app=clientportal

i am able to get to the app with http://windows-node-ip:30875 but I cant get to the app like this http://10.243.0.39 i cant curl the app from one of the linux nodes using clusterIP or the serviceip. I can with the actual nodeIP

i do notice on the window node in the /var/log/kubelet log file some errors like this.

cni_windows.go:59] error while adding to cni network: error while GETHNSNewtorkByName(flannel.4096): Network flannel.4096 not found

file.go:104] Unable to read config path "C:\\var\\lib\\kubelet\\etc\\kubernetes\\manifests": path does not exist, ignoring

Kubectl get nodes shows the windows node is ready kubetcl get pods shows running pods on the windows node docker container ls on windows nodes shows the containers are running and scheduled.

we did upgrade to 1.19.2 of kubelet and kubeadm but it looks like we have had this issue for some time.

C:\k>kubeadm version kubeadm version: &version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:38:53Z", GoVersion:"go1.15", Compiler:"gc", Platform:"windows/amd64"}

attached are some of the kubelet log files.

Also the logs from kubectl -n kube-system logs kube-flannel-ds-windows-amd64-vrsg9

`Mode LastWriteTime Length Name

d----- 8/20/2020 4:35 PM serviceaccount WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose parameter. For a list of approved verbs, type Get-Verb. Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False} At C:\k\flannel\hns.psm1:233 char:16

... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...


+ CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1008 10:46:56.667299 8008 main.go:518] Determining IP address of default interface I1008 10:46:57.507782 8008 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202 I1008 10:46:57.507782 8008 main.go:548] Defaulting external address to interface address (10.243.1.202) I1008 10:46:57.545795 8008 kube.go:119] Waiting 10m0s for node controller to sync I1008 10:46:57.545795 8008 kube.go:306] Starting kube subnet manager I1008 10:46:58.551449 8008 kube.go:126] Node controller sync successful I1008 10:46:58.551449 8008 main.go:246] Created subnet manager: Kubernetes Subnet Manager - aabrw-kuber03 I1008 10:46:58.551449 8008 main.go:249] Installing signal handlers I1008 10:46:58.551449 8008 main.go:390] Found network config - Backend type: vxlan I1008 10:46:58.551449 8008 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false DirectRouting=false I1008 10:46:58.619205 8008 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [ ] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12 3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}} E1008 10:46:59.972661 8008 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50491 ->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine. E1008 10:46:59.973662 8008 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=10547296&timeoutSeconds=582&watch=true: http2: no cached connection was available E1008 10:47:01.036947 8008 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get`

kubelet.exe.logs.zip

any feedback or guidance would be appreciated.

jsturtevant commented 4 years ago

The issue is that flannel is starting before the external network is created. The work around is to Restart the Flannel pod.

The network is created and then flannel is started here:

https://github.com/kubernetes-sigs/sig-windows-tools/blob/1f4abb21ff35d68b1b2c5d49eefb2daa05bc98d8/kubeadm/flannel/flannel-overlay.yml#L34-L36

The external network doesn't finish creating before flannel is started causing the bad loop

llyons commented 4 years ago

Do we restart those pods by running those commands above or deleting pods and having them restart

jsturtevant commented 4 years ago

They run in a DaemonSet so you can kubectl delete pod or run kubectl rollout restart daemonset

llyons commented 4 years ago

so after trying this, it seems like we still have the same core issue. The windows simple webapps running on the windows node are not reachable from the linux machines (curl clusterip, curl svc ip). the linux apps on linux nodes are reachable. The windows container on the windows node can be reached with the windows node IP: port.

The new info in the logs is this.

`Mode LastWriteTime Length Name

d----- 10/9/2020 11:38 AM flannel

Directory: C:\host\k\flannel\var\run\secrets\kubernetes.io

Mode LastWriteTime Length Name

d----- 10/9/2020 10:46 AM serviceaccount WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose parameter. For a list of approved verbs, type Get-Verb. Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False} At C:\k\flannel\hns.psm1:233 char:16

... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...


+ CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1009 12:34:16.131480 7100 main.go:518] Determining IP address of default interface I1009 12:34:17.551409 7100 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202 I1009 12:34:17.551409 7100 main.go:548] Defaulting external address to interface address (10.243.1.202) I1009 12:34:17.597426 7100 kube.go:119] Waiting 10m0s for node controller to sync I1009 12:34:17.597426 7100 kube.go:306] Starting kube subnet manager I1009 12:34:18.612968 7100 kube.go:126] Node controller sync successful I1009 12:34:18.612968 7100 main.go:246] Created subnet manager: Kubernetes Subnet Manager - ssssssss I1009 12:34:18.612968 7100 main.go:249] Installing signal handlers I1009 12:34:18.612968 7100 main.go:390] Found network config - Backend type: vxlan I1009 12:34:18.612968 7100 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false DirectRouting=false I1009 12:34:19.445335 7100 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [ ] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12 3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}} E1009 12:34:21.947399 7100 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50521 ->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine. E1009 12:34:21.947399 7100 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=10782832&timeoutSeconds=582&watch=true: http2: no cached connection was available E1009 12:34:23.075801 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1009 12:34:24.125142 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1009 12:34:25.203474 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1009 12:34:25.966698 7100 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096 E1009 12:34:26.214764 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1009 12:34:26.522847 7100 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (10.243.1. 202) E1009 12:34:27.219039 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1009 12:34:28.246212 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1009 12:34:29.233905 7100 device_windows.go:145] Created HostComputeNetwork flannel.4096 E1009 12:34:29.277916 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1009 12:34:29.341929 7100 main.go:313] Changing default FORWARD chain policy to ACCEPT I1009 12:34:29.343930 7100 main.go:321] Wrote subnet file to /run/flannel/subnet.env I1009 12:34:29.343930 7100 main.go:325] Running backend. I1009 12:34:29.344933 7100 main.go:343] Waiting for all goroutines to exit I1009 12:34:29.344933 7100 vxlan_network_windows.go:63] Watching for new subnet leases E1009 12:34:30.281059 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1009 12:34:31.287135 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1009 12:34:32.303437 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available

Should I attempt to add the node again? Not sure what to do next.

llyons commented 4 years ago

One thing I noticed is that our Operating system is 1809 and we have Kubernetes 1.19.2.

Do we need kubernetes 1.18 to make this work?

llyons commented 4 years ago

So we tried to make sure the version of our OS (1809) did not include a September patch that has shown to cause these issues. That being identified in this issue https://github.com/microsoft/Windows-Containers/issues/61

after setting this up we restarted the windows node and made sure the pods where back to running and we still have the same issues as before.

Our OS version is now Microsoft Windows [Version 10.0.17763.1397]

the logs still show the same issues. We cant access the simple web apps on the windows node through serviceIP or clusterIP but we can get to the running containers on the windows node using the NodeIP:port

The windows node is running, all the pods are running. Linux web apps on linux nodes are accessible.

Here is contents of the log files.

c:\var\logs\kubelet on windows node

kubelet.exe.AABRW-KUBER03.OLH_AABRW-KUBER03$.log.ERROR.20201014-120556.zip

results of kubectl -n kube-system logs kube-flannel-ds-windows-amd64-fw8k9

Mode LastWriteTime Length Name

d----- 10/13/2020 11:53 AM serviceaccount WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose parameter. For a list of approved verbs, type Get-Verb. Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False} At C:\k\flannel\hns.psm1:233 char:16

... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...


+ CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1014 12:07:44.490869 10480 main.go:518] Determining IP address of default interface I1014 12:07:46.139972 10480 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202 I1014 12:07:46.139972 10480 main.go:548] Defaulting external address to interface address (10.243.1.202) I1014 12:07:46.158979 10480 kube.go:119] Waiting 10m0s for node controller to sync I1014 12:07:46.158979 10480 kube.go:306] Starting kube subnet manager I1014 12:07:47.192086 10480 kube.go:126] Node controller sync successful I1014 12:07:47.192086 10480 main.go:246] Created subnet manager: Kubernetes Subnet Manager - aabrw-kuber03 I1014 12:07:47.192086 10480 main.go:249] Installing signal handlers I1014 12:07:47.192086 10480 main.go:390] Found network config - Backend type: vxlan I1014 12:07:47.192086 10480 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false DirectRouting=false I1014 12:07:47.416559 10480 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [ ] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12 3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}} E1014 12:07:48.689549 10480 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50567 ->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine. E1014 12:07:48.690549 10480 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=11876190&timeoutSeconds=582&watch=true: http2: no cached connection was available E1014 12:07:49.767699 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1014 12:07:50.781831 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1014 12:07:51.809827 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1014 12:07:52.464903 10480 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096 E1014 12:07:52.827943 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1014 12:07:53.036966 10480 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (10.243.1. 202) E1014 12:07:53.846512 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available I1014 12:07:54.091535 10480 device_windows.go:145] Created HostComputeNetwork flannel.4096 I1014 12:07:54.175544 10480 main.go:313] Changing default FORWARD chain policy to ACCEPT I1014 12:07:54.183543 10480 main.go:321] Wrote subnet file to /run/flannel/subnet.env I1014 12:07:54.183543 10480 main.go:325] Running backend. I1014 12:07:54.183543 10480 main.go:343] Waiting for all goroutines to exit I1014 12:07:54.183543 10480 vxlan_network_windows.go:63] Watching for new subnet leases E1014 12:07:54.850768 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available E1014 12:07:55.853629 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list v1.Node: Get https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available

vitaliy-leschenko commented 4 years ago

I rolled back my servers (on my test cluster) to 10.0.17763.1294 and can say that all windows pods are reachable after reboot.

PS C:\Users\v.leschenko> kubectl get nodes -owide
NAME           STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION       CONTAINER-RUNTIME
k8s            Ready    master   197d    v1.19.0   192.168.2.20   <none>        Ubuntu 18.04.5 LTS        4.15.0-118-generic   docker://19.3.12
k8s-us1804-a   Ready    <none>   197d    v1.19.0   192.168.2.25   <none>        Ubuntu 18.04.5 LTS        4.15.0-118-generic   docker://19.3.12
k8s-ws1809-a   Ready    <none>   2d21h   v1.19.0   192.168.2.21   <none>        Windows Server Standard   10.0.17763.1294      docker://19.3.12
k8s-ws1809-b   Ready    <none>   2d11h   v1.19.0   192.168.2.22   <none>        Windows Server Standard   10.0.17763.1294      docker://19.3.12
k8s-ws1809-c   Ready    <none>   3d      v1.19.0   192.168.2.23   <none>        Windows Server Standard   10.0.17763.1457      docker://19.3.12

k8s-ws1809-c will wait until fix without reboot to check that fix is working.

llyons commented 4 years ago

another related piece of info. with this svc (windows app on windows node)

clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 8d

I am not able to curl 10.243.0.39 from the linux master but I CAN curl it from the linux worker. clientportal is one of the apps running on the windows node.

with this service (webapp on linux master)

frontend LoadBalancer 10.110.169.97 10.243.0.36 80:30889/TCP 32dith

another similar scenario, a webapp (frontend) running on the linux master is reachable from the master linux node with curl 10.243.0.36 However on the actual linux worker node, I am not able to curl 10.243.0.36

jsturtevant commented 4 years ago

I did some digging on the error messages above and I think there are a few things happening:

HNS issue

Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False} At C:\k\flannel\hns.psm1:233 char:16

This is happening because you are passing Ethernet0 2 to to the HNS module. While this should work it is being passed through wins.exe: https://github.com/kubernetes-sigs/sig-windows-tools/blob/1f4abb21ff35d68b1b2c5d49eefb2daa05bc98d8/kubeadm/flannel/flannel-overlay.yml#L34 (from our slack convo I know you've replaced Ethernet with Ethernet0 2 as described in the documentation)

The issue is wins.exe splits arguments on spaces:

https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/client/process/run.go#L75 https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/cmds/flags/list_value.go#L11-L13 https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/cmds/flags/list_value.go#L30

This is causing the Only Ethernet0 to be passed and there for the error.

Another aspect of this is this error An adapter was not found. Can also be caused when a network and vswitch is already attached to an Adapter. One work around to next issue (Failed to watch) is to restart the Flannel pod. On the first creation of the Attempting to create HostComputeNetwork the flannel.4096 is created and a switch is attached to the adapter which can also cause the An adapter was not found. error.

Flannel Network creation and Failed to list

The creation of the external Network by HNS isn't strictly needed since this PR in flannel went in. This is why after a restarting the flannel pod things start to work, the flannel creates the network attached to the correct adapter (I1014 12:07:46.139972 10480 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202)

It seems there is a timing issue with Flannel creating the network though:

I1014 12:07:47.416559 10480 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [ ] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12 3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}} E1014 12:07:48.689549 10480 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50567 ->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine. E1014 12:07:48.690549 10480 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get

Here we see it creating the network (flannel.4096) and starting to list the Kubernetes nodes via the golang client. The flannel network takes some time and causes a network hicup when first creating the VM switch (see this comment). The network blip causes the connection to the apiserver to get into a bad cached state as defined in https://github.com/coreos/flannel/issues/1272.

Work arounds

By creating the external network first via HNS you avoid this issue completely because there is not network disconnect during the time of the flannel network creation. One option is during node set up you create the External network before deploying flannel this should resolve the issue.

To fix this in the Docker image requires some extra work since it looks like wins.exe is no longer taking issues (issues look to be disabled on the repository). The work around for the arguements being split are not to elegant, either encode the space and decode in the setup binary or pass the value via file to the setup binary. I have a working version which I will clean up some more before submitting: https://github.com/kubernetes-sigs/sig-windows-tools/compare/master...jsturtevant:wait-for-network?expand=1 a PR.

Ultimatly the fix should go in to flannel to reset connections properly or what till network is fully stable. There is a long standing open issue in the golang kubernetes client blocker on this issue that could potently fix the issue as well: https://github.com/kubernetes/client-go/issues/374

jsturtevant commented 4 years ago

/assign

jsturtevant commented 4 years ago

looks like the creation of the external network prior was trying to solve this issue but the issue with wins args not parsing properly keeps it around: https://github.com/kubernetes-sigs/sig-windows-tools/issues/37

jsturtevant commented 4 years ago

To fix this in the Docker image requires some extra work since it looks like wins.exe is no longer taking issues (issues look to be disabled on the repository).

I got connected from folks that work on wins.exe and they are working on a fix. Will open an PR to update this once we have a new package.

Fyi - for future wins issues from slack conversation:

submit all issues to the rancher/rancher repo if you find things like wins who have issues turned off. rancher PMs/engineers watch that repo and will find the right people to do the work

llyons commented 4 years ago

so we have the latest October CU now installed and we still have the same issue as described above. Do we know anything else to try?

jsturtevant commented 4 years ago

The Windows update was not the issue here. It is https://github.com/coreos/flannel/issues/1272.

Until Flannel is fixed the wins.exe needs to be updated to be able to create the external networks that have spaces in them like Ethernet0 2.

The workaround until Wins.exe or flannel is fixed is to manually create the External network before starting flannel.

llyons commented 4 years ago

Are there any instructions on how to create the external network before starting flannel? I did try to rename the ethernet adapter to just ethernet2 and that didnt seem to help fix the issue.

I noticed that the wins commands that are run refer to a setup.exe or flanneld.exe which dont exist in the /k/flannel folder on the windows node.

wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet2"
wins cli route add --addresses 169.254.169.254
wins cli process run --path /k/flannel/flanneld.exe --args "--kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"

jsturtevant commented 4 years ago

@llyons the setup.exe source is here: https://github.com/kubernetes-sigs/sig-windows-tools/blob/efb98c3fe813613be2caa82ff4ef7537b534a4ba/kubeadm/flannel/setup.go#L13-L17

and includes the powershell to create the external network:

New-HNSNetwork -Type Overlay -AddressPrefix "192.168.255.0/30" -Gateway "192.168.255.1" -Name "External" -AdapterName "Ethernet0 2" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; });

Note that will fail if flannel already created a network on the nic, in which case you will need to remove the flannel network.

llyons commented 4 years ago

if we setup our CIDR as 192.168.0.0/16 should we try to (first remove existing) then run those powershell commands to setup network in the range of 192.168.0.0/16 and gateway of 192.168.0.1?

also if its already setup, we want to remove the flannel.4096 network or the external?

It looks like we have External (tied to Ethernet2), nat, flannel.4096 and 89b601bd3b8b4850bc7711537882a6c9aa3788b6f7c11854518dc4733d686c0e

is this network that we are creating, part of the actual ethernet network allowing connectivity or is this a new network that is being created. If I try to run the above New-HNSNetwork command on the existing ethernet2 adapter, it says it already exists.

jsturtevant commented 4 years ago

Which cidr are you referring to? I think you will want you node/pod cider to not overlap with the external metric cidr. My understanding is this external network is for creating the vswitch which enables the network connectivity via the adapter. It isn't really needed except for the bug in flannel coreos/flannel#1272. @ksubrmnn might be able to explain better.

If I try to run the above New-HNSNetwork command on the existing ethernet2 adapter, it says it already exists.

Yes, This should be on a fresh node or you will need to clean up all the different networks that might have been created.

llyons commented 4 years ago

I might not have said this properly above. Any help or guidance on this would be appreciated @ksubrmnn

If I have this return from Get-HNSNetwork


ActivityId             : 0481DD58-698B-4829-8FF7-02407876752E
AdditionalParams       :
CurrentEndpointCount   : 0
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=False; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 0
Health                 : @{AddressNotificationMissedCount=0; AddressNotificationSequenceNumber=0;
                         InterfaceNotificationMissedCount=0; InterfaceNotificationSequenceNumber=0; LastErrorCode=0;
                         LastUpdateTime=132478756989051774; RouteNotificationMissedCount=0;
                         RouteNotificationSequenceNumber=0}
ID                     : 777F0851-EF37-4D73-BAE3-8F3464294CCB
IPv6                   : False
LayeredOn              : 85D8CB85-C25B-4B8E-82A7-A81110A9EB91
MacPools               : {@{EndMacAddress=00-15-5D-C4-FF-FF; StartMacAddress=00-15-5D-C4-F0-00}}
MaxConcurrentEndpoints : 0
Name                   : nat
NatName                : ICSB758DD0D-1851-4C13-A6A8-3630CEBD4726
Policies               : {}
Resources              : @{AdditionalParams=; AllocationOrder=2; Allocators=System.Object[]; Health=;
                         ID=0481DD58-698B-4829-8FF7-02407876752E; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=BDD4F023-90C0-43FF-BA5F-F9920C901B5C}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=172.27.176.0/20; GatewayAddress=172.27.176.1; Health=;
                         ID=FE5C7CE2-3D7D-43EC-9E21-6D217F7C1106; Policies=System.Object[]; State=0}}
TotalEndpoints         : 0
Type                   : nat
Version                : 38654705667

ActivityId             : 7104552C-95E1-49BF-939F-D12E40B386B8
AdditionalParams       :
CurrentEndpointCount   : 1
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=False; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 0
Health                 : @{AddressNotificationMissedCount=0; AddressNotificationSequenceNumber=0;
                         InterfaceNotificationMissedCount=0; InterfaceNotificationSequenceNumber=0; LastErrorCode=0;
                         LastUpdateTime=132478756998524596; RouteNotificationMissedCount=0;
                         RouteNotificationSequenceNumber=0}
ID                     : DE8B02C3-F5E4-436A-B2AD-86D3D34A4B12
IPv6                   : False
LayeredOn              : 85D8CB85-C25B-4B8E-82A7-A81110A9EB91
MacPools               : {@{EndMacAddress=00-15-5D-2D-1F-FF; StartMacAddress=00-15-5D-2D-10-00}}
MaxConcurrentEndpoints : 2
Name                   : 69746ee3532666b83adb8edea7f2b9d49d4ea191a7ef620c9bf95b17f5d170d7
NatName                : ICS35158D80-1C7D-4937-AF77-002858685E7D
Policies               : {}
Resources              : @{AdditionalParams=; AllocationOrder=2; Allocators=System.Object[]; Health=;
                         ID=7104552C-95E1-49BF-939F-D12E40B386B8; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=BDD4F023-90C0-43FF-BA5F-F9920C901B5C}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=172.22.192.0/20; GatewayAddress=172.22.192.1; Health=;
                         ID=EF24497B-1AAE-4DC6-94AB-937035687E18; Policies=System.Object[]; State=0}}
TotalEndpoints         : 2
Type                   : nat
Version                : 38654705667

ActivityId             : 8A7055F0-550E-4A89-8951-F83DA1A54EC4
AdditionalParams       :
CurrentEndpointCount   : 2
DNSServerCompartment   : 7
DrMacAddress           : 00-15-5D-36-09-79
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=True; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 8
Health                 : @{LastErrorCode=0; LastUpdateTime=132478760658733588}
ID                     : A9A09EB9-F565-4E92-B4E0-72CA273F7EF6
IPv6                   : False
InterfaceConstraint    : @{InterfaceGuid=00000000-0000-0000-0000-000000000000}
LayeredOn              : 68C26E4B-B00A-4097-A7A0-5236D358B510
MacPools               : {@{EndMacAddress=00-15-5D-4E-4F-FF; StartMacAddress=00-15-5D-4E-40-00}}
ManagementIP           : 10.243.1.202
MaxConcurrentEndpoints : 2
Name                   : flannel.4096
Policies               : {@{Type=HostRoute}, @{DestinationPrefix=192.168.1.0/24;
                         DistributedRouterMacAddress=6a:60:9e:b2:c9:50; IsolationId=4096;
                         ProviderAddress=10.243.1.213; Type=RemoteSubnetRoute}, @{DestinationPrefix=192.168.0.0/24;
                         DistributedRouterMacAddress=42:72:18:81:ac:6f; IsolationId=4096;
                         ProviderAddress=10.243.1.212; Type=RemoteSubnetRoute}}
Resources              : @{AdditionalParams=; AllocationOrder=1; Allocators=System.Object[]; Health=;
                         ID=8A7055F0-550E-4A89-8951-F83DA1A54EC4; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=D81E13D4-58B9-4D5F-8972-4606BBE27C41}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=192.168.2.0/24; GatewayAddress=192.168.2.1; Health=;
                         ID=0B389370-6A1B-4DE8-B4CA-D50A153284CF; ObjectType=5; Policies=System.Object[]; State=0}}
TotalEndpoints         : 5
Type                   : Overlay
Version                : 38654705667

and then this image for my current adapters

How would I want to proceed. Assuming I have the cluster CIDR of 192.168.0.0/16 and metallb also serving up IP addresses from a pool, would I do this?

New-HNSNetwork -Type Overlay -AddressPrefix "192.168.0.0/16" -Gateway "192.168.0.1" -Name "External" -AdapterName "Ethernet" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; });

I am pretty desperate to get the windows portion working.

remember we can curl the windows serviceip and cluster ip of pod on windows node from Linux node.. but we cant get that serviceIP or cluster IP exposed outside of the 2 linux cluster nodes. apps running on the linux nodes are exposed and do render outside of the cluster using the serviceip

llyons commented 4 years ago

We where able to determine that in our configuration it looks like metalLB, which provides an IP from a pool, did not have a speaker in the windows node that prevented the IP provisioned from being accessible from outside.

Instead we put the windows containers on the windows node behind a ingress resource and setup a ingress service of type load balancer to handle this. So in essence we are getting access through the ingress controllers running on the linux portion of the cluster.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

jsturtevant commented 3 years ago

/lifecycle frozen