Calico on Windows Server 2022 - 100% Package loss of incoming connections

mkaring commented 1 year ago

Expected Behavior

Following the guides of Calico to setup the Kubernetes node on Windows Server 2022 should result in a working network that is able to communicate with the rest of the cluster and the outside world just fine.

The guide followed can be found here: Install Calico for Windows

Current Behavior

The problem is that the communication works just fine outbound. I'm testing the command nslookup google.com on a Windows Server 2022 container running on the node, setup by the Calico scripts. The CoreDNS on the control plane is receiving the request and responding to it, but the response never reaches the container that requested the DNS resolution.

I'm using VXLAN as overlay for the connection. Reason for me choosing this is because the documentation of BGP sounds scatchy and I can't have the outside network know about my pods running in any way.

To be receive, no communication of any kind seems to be reaching the pod:

> netsh interface ipv4 show subinterfaces

       MTU  MediaSenseState      Bytes In     Bytes Out  Interface
----------  ---------------  ------------  ------------  -------------
      1450                1             0        326044  vEthernet (a8abd4d612852c86afe71cf9bcc93848c4da974acf15c09e2aaaa58707e60652_Calico)
4294967295                1             0             0  Loopback Pseudo-Interface 7

The general details about my nodes:

NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION      CONTAINER-RUNTIME  
srv31   Ready    <none>          64d   v1.27.3   192.168.251.31   <none>        Windows Server 2022 Standard   10.0.20348.1850     containerd://1.7.1 
srv38   Ready    control-plane   65d   v1.27.3   192.168.251.38   <none>        Ubuntu 22.04.2 LTS             5.15.0-76-generic   containerd://1.6.21

The IP Pool is setup correctly to my understanding:

> kubectl get ippool -o yaml
apiVersion: v1
items:
- apiVersion: projectcalico.org/v3
  kind: IPPool
  metadata:
    creationTimestamp: "2023-06-23T10:57:24Z"
    name: default-ipv4-ippool
    resourceVersion: "5529656"
    uid: f42288b5-f631-46b9-81e4-f8dec0f0da3c
  spec:
    allowedUses:
    - Workload
    - Tunnel
    blockSize: 26
    cidr: 172.19.0.0/16
    ipipMode: Never
    natOutgoing: true
    nodeSelector: all()
    vxlanMode: Always
kind: List
metadata:
  resourceVersion: ""

The logs don't show any problems to me, but I'm running out of ideas what to look for. So if any log from any component would be helpful tracking the issue: Tell me, I get it.

Now what I did find out as potential issue is the following: I did capture the packages on the ethernet link on the Windows Server (not the Hyper-V connection, the physical one). This showed the packages of the DNS request encapsulated by VXLAN. The outbound packages appear fine. The actual package has the proper IP and MAC addresses for the physical servers hosting the kubernetes nodes. The headers inside the VXLAN encapsulation contain the IP and MAC of the containers communicating as expected. How ever the response package with the DNS resolution (it reaches the physical ethernet connection) it addressed inside the VXLAN encapsulation to the IP of the correct container, but to a different MAC. The destination mac matches the MAC of the Hyper-V switch. This seems wrong, but I have no idea what could be causing this issue.

For testing I did start a pod on the linux control-plane server and tried the DNS resolution here. This works just fine.

Steps to Reproduce (for bugs)

For the linux side, of things, the installation is done using the helm chart as described here: Install using Helm The additional configuration is:

installation:
  calicoNetwork:
    bgp: Disabled
    ipPools:
    - cidr: 172.19.0.0/16
      encapsulation: VXLAN

Follow the guide here exactly: Install Calico for Windows The setup encapsulation is VXLAN, datastore is Kubernetes.

The alternative installation for calico using the HostProcess containers, also works and has the same outcome.

Context

In general I'm "only" trying the setup a Kubernetes with a Windows Node up.

Your Environment

Calico version: 3.26.1
Kubernetes v1.27.3
Operating System and version:
- Ubuntu 22.04.2 LTS (Control-Plane)
- Windows Server 2022 Standard - 10.0.20348.1850 (The node with all the problems)

Final words

At this point I have exhausted all ideas what the reason for this could be. I'd be thankful for any pointers where I could still look for problems.

coutinhop commented 1 year ago

Hi @mkaring! Could you share the packet captures you took with the MAC address discrepancies? (Also some details as to the IP and MAC addresses of the interfaces so that we can interpret the captures)

Could you also enable debug logging by setting logSeverityScreen to debug in the default FelixConfiguration? Then, for non-HPC install, you'd need remote access (e.g. RDP) to the Windows machine to get calico logs from there (should be in c:/CalicoWindows/logs. kubelet other logs in c:/k could also be useful.

Finally, is your cluster bare metal, or are you using a cloud service (AWS, GCP, etc)? If you are, could you tell us which one and any details of the setup?

Thanks!

mkaring commented 1 year ago

Hello @coutinhop. Thanks for the response.

My cluster is base metal. The control-plane and the windows node are completely installed by myself. Not sure what else I could tell you about it. The only thing of note is that the control-plane is running on a VMWare virtual machine, but I'm not expecting major hassle in regards to the networking caused by this.

I changed the logging severity from "Info" to "Debug" using:

kubectl patch FelixConfiguration default -p '{"spec":{"logSeverityScreen":"Debug"}}'

Configuration for felix is now:

apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  creationTimestamp: "2023-06-05T23:40:18Z"
  generation: 1
  name: default
  resourceVersion: "11011825"
  uid: 4b88f98f-0183-4831-845f-3e285f2b56ca
spec:
  bpfLogLevel: ""
  floatingIPs: Disabled
  healthPort: 9099
  ipipEnabled: false
  logSeverityScreen: Debug
  reportingInterval: 0s

I did a restart of the Windows Server to get clean logs of the entire problem. The startup procedure "seems" to work just fine. The kublet starts and the calico services start just fine and create the HNS network that disappears at every restart.

Now to the records of the issues. Find the wireshark capture of the namespace lookup here. Basically I captured everything that is encapsulated with VXLAN. records_srv31_windows.pcapng.gz

The following nodes are part of the exchange:

node   os       role           ip              mac
-----  -------  -------------  --------------  -----------------
srv31  Windows  node           192.168.251.31  00:25:90:bc:42:70
srv38  Ubuntu   control-plane  192.168.251.38  00:50:56:b7:c4:95

And the following pods:

pod                       location  ip              mac
------------------------  --------  --------------  ----------------
coredns-5d78c9869d-dqn84  srv38     172.19.232.209  66:bc:63:e6:1b:7d
coredns-5d78c9869d-s7gxh  srv38     172.19.232.206  66:bc:63:e6:1b:7d
windows-dnsutils          srv31     172.19.134.33   0e:2a:ac:13:86:21

Now if you check the packages being send, you find the request being sent like:

source             destination
-----------------  -----------------
00:25:90:bc:42:70  00:50:56:b7:c4:95
192.168.251.31     192.168.251.38
===== VXLAN =====  ===== VXLAN =====
0e:2a:ac:13:86:21  66:bc:63:e6:1b:7d
172.19.134.33      172.19.232.209

This looks alright to me. CoreDNS also receives the request and handles it:

172.19.134.33:53047 - 2 "A IN google.com. udp 28 false 512" NOERROR qr,rd,ra 54 0.002791281s

How ever the response looks this:

source             destination
-----------------  -----------------
00:50:56:b7:c4:95  00:25:90:bc:42:70
192.168.251.38     192.168.251.31
===== VXLAN =====  ===== VXLAN =====
66:bc:63:e6:1b:7d  00:15:5d:45:1c:56
172.19.232.209     172.19.134.33

And this is where I'm guessing the problem is located. 00:15:5d:45:1c:56 is unexpected as the mac address. I found this exact mac address to be the DrMacAddress to be assigned to the HNS Networks. get-hnsnetwork-response.txt

As to all the remaining logs, find them in the following zip file: logs_srv31.zip

The entire setup is currently not used for anything productive, so I'm free to test and do whatever is required to figure out what is wrong here.

Thank you for the support, Martin

pramodsharma62 commented 12 months ago

Did you get this resolved. I am facing exact same issue except I am using BGP with my windows and Linux. apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: calicoNetwork: ipPools:

blockSize: 26 cidr: 10.48.0.0/24 encapsulation: None natOutgoing: Enabled nodeSelector: all() bgp : Enabled

install-calico-3.26.0-windows.ps1 with BGP

` [parameter(Mandatory = $false)] $ReleaseBaseURL="https://github.com/projectcalico/calico/releases/download/v3.26.0/", [parameter(Mandatory = $false)] $ReleaseFile="calico-windows-v3.26.0.zip", [parameter(Mandatory = $false)] $KubeVersion="1.28.1", [parameter(Mandatory = $false)] $DownloadOnly="no", [parameter(Mandatory = $false)] $StartCalico="yes",

As of Kubernetes version v1.24.0, service account token secrets are no longer automatically created. But this installation script uses that secret

to generate a kubeconfig so default to creating the calico-node token secret if it doesn't exist.

[parameter(Mandatory = $false)] $AutoCreateServiceAccountTokenSecret="yes", [parameter(Mandatory = $false)] $Datastore="kubernetes", [parameter(Mandatory = $false)] $EtcdEndpoints="", [parameter(Mandatory = $false)] $EtcdTlsSecretName="", [parameter(Mandatory = $false)] $EtcdKey="", [parameter(Mandatory = $false)] $EtcdCert="", [parameter(Mandatory = $false)] $EtcdCaCert="", [parameter(Mandatory = $false)] $ServiceCidr="10.49.0.0/16", [parameter(Mandatory = $false)] $DNSServerIPs="10.49.0.1", [parameter(Mandatory = $false)] $CalicoBackend="windows-bgp"`

mkaring commented 12 months ago

Sadly not. I had some lengthy discussions with @coutinhop, but we did not find any solution to this problem as of right now.

pramodsharma62 commented 12 months ago

I got it working, I missed some BGP configuration for windows. https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/kubernetes/standard#install-calico-and-kubernetes-on-windows-nodes

Everything is working as expected now. All my pods can talk to each other, resolve all DNS names in both Linux and Windows nodes.

Since it's Calico with BGP, I integrated it with my router and now I can even reach k8s resources outside the cluster with pod/service/loadblancer IP in both windows and Linux. I love it!

isarns commented 11 months ago

We work with VXLAN as well, with windows 2022 our pod can't ping any external service for example google.com

Or one pod can ping and the other one can't.

We also noticed that if one pod can ping google.com and we spin up another pod (or even spin down) the working pod will stop communicating 😔

For our debugging we have created a server in our network and tried to ping it the icmp packet gets to the server and the server replies, but it seems that the node just drops the packets and the packet doesn't reach the pod.

coutinhop commented 8 months ago

I got it working, I missed some BGP configuration for windows. https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/kubernetes/standard#install-calico-and-kubernetes-on-windows-nodes

Everything is working as expected now. All my pods can talk to each other, resolve all DNS names in both Linux and Windows nodes.

Since it's Calico with BGP, I integrated it with my router and now I can even reach k8s resources outside the cluster with pod/service/loadblancer IP in both windows and Linux. I love it!

Thanks @pramodsharma62! @mkaring could you please double check these BGP configurations on your setup? That link no longer works due to our docs having been reorganized, but this one does: https://github.com/projectcalico/calico/issues/7875#issuecomment-1799252658

@isarns this seems like a different issue since you're using VXLAN. Could you please open an issue and fill in the template? Try to include calico-node logs if possible (that'd be the first thing I'd ask anyway). Sorry, just trying to keep discussions separate.

mkaring commented 8 months ago

@coutinhop I'm using VXLAN as well. I checked the if I did any steps of BGP steps at some point, to see if something is messing with the networking.

Nothing in relation to BGP is installed.

Get-WindowsFeature RemoteAccess, RSAT-RemoteAccess-PowerShell, Routing

Display Name                                            Name                       Install State
------------                                            ----                       -------------
[ ] Remotezugriff                                       RemoteAccess                   Available
    [ ] Routing                                         Routing                        Available
            [X] Remotezugriffsmodul für Windows Powe... RSAT-RemoteAccess-Po...        Installed

The remote access is not enabled as well:

Get-RemoteAccess

DAStatus          : Uninstalled
VpnStatus         : Uninstalled
VpnS2SStatus      : Uninstalled
SstpProxyStatus   : Uninstalled
RoutingStatus     : Uninstalled
LoadBalancing     :
InternetInterface :
InternalInterface :
SslCertificate    :

I checked the installation default and made sure that calicoNetwork → bgp is set to disabled.

projectcalico / calico