Open mkaring opened 1 year ago
Hi @mkaring! Could you share the packet captures you took with the MAC address discrepancies? (Also some details as to the IP and MAC addresses of the interfaces so that we can interpret the captures)
Could you also enable debug logging by setting logSeverityScreen
to debug
in the default FelixConfiguration? Then, for non-HPC install, you'd need remote access (e.g. RDP) to the Windows machine to get calico logs from there (should be in c:/CalicoWindows/logs
. kubelet other logs in c:/k
could also be useful.
Finally, is your cluster bare metal, or are you using a cloud service (AWS, GCP, etc)? If you are, could you tell us which one and any details of the setup?
Thanks!
Hello @coutinhop. Thanks for the response.
My cluster is base metal. The control-plane and the windows node are completely installed by myself. Not sure what else I could tell you about it. The only thing of note is that the control-plane is running on a VMWare virtual machine, but I'm not expecting major hassle in regards to the networking caused by this.
I changed the logging severity from "Info" to "Debug" using:
kubectl patch FelixConfiguration default -p '{"spec":{"logSeverityScreen":"Debug"}}'
Configuration for felix is now:
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
creationTimestamp: "2023-06-05T23:40:18Z"
generation: 1
name: default
resourceVersion: "11011825"
uid: 4b88f98f-0183-4831-845f-3e285f2b56ca
spec:
bpfLogLevel: ""
floatingIPs: Disabled
healthPort: 9099
ipipEnabled: false
logSeverityScreen: Debug
reportingInterval: 0s
I did a restart of the Windows Server to get clean logs of the entire problem. The startup procedure "seems" to work just fine. The kublet starts and the calico services start just fine and create the HNS network that disappears at every restart.
Now to the records of the issues. Find the wireshark capture of the namespace lookup here. Basically I captured everything that is encapsulated with VXLAN. records_srv31_windows.pcapng.gz
The following nodes are part of the exchange:
node os role ip mac
----- ------- ------------- -------------- -----------------
srv31 Windows node 192.168.251.31 00:25:90:bc:42:70
srv38 Ubuntu control-plane 192.168.251.38 00:50:56:b7:c4:95
And the following pods:
pod location ip mac
------------------------ -------- -------------- ----------------
coredns-5d78c9869d-dqn84 srv38 172.19.232.209 66:bc:63:e6:1b:7d
coredns-5d78c9869d-s7gxh srv38 172.19.232.206 66:bc:63:e6:1b:7d
windows-dnsutils srv31 172.19.134.33 0e:2a:ac:13:86:21
Now if you check the packages being send, you find the request being sent like:
source destination
----------------- -----------------
00:25:90:bc:42:70 00:50:56:b7:c4:95
192.168.251.31 192.168.251.38
===== VXLAN ===== ===== VXLAN =====
0e:2a:ac:13:86:21 66:bc:63:e6:1b:7d
172.19.134.33 172.19.232.209
This looks alright to me. CoreDNS also receives the request and handles it:
172.19.134.33:53047 - 2 "A IN google.com. udp 28 false 512" NOERROR qr,rd,ra 54 0.002791281s
How ever the response looks this:
source destination
----------------- -----------------
00:50:56:b7:c4:95 00:25:90:bc:42:70
192.168.251.38 192.168.251.31
===== VXLAN ===== ===== VXLAN =====
66:bc:63:e6:1b:7d 00:15:5d:45:1c:56
172.19.232.209 172.19.134.33
And this is where I'm guessing the problem is located. 00:15:5d:45:1c:56
is unexpected as the mac address.
I found this exact mac address to be the DrMacAddress
to be assigned to the HNS Networks.
get-hnsnetwork-response.txt
As to all the remaining logs, find them in the following zip file: logs_srv31.zip
The entire setup is currently not used for anything productive, so I'm free to test and do whatever is required to figure out what is wrong here.
Thank you for the support, Martin
Did you get this resolved. I am facing exact same issue except I am using BGP with my windows and Linux. apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: calicoNetwork: ipPools:
blockSize: 26 cidr: 10.48.0.0/24 encapsulation: None natOutgoing: Enabled nodeSelector: all() bgp : Enabled
` [parameter(Mandatory = $false)] $ReleaseBaseURL="https://github.com/projectcalico/calico/releases/download/v3.26.0/", [parameter(Mandatory = $false)] $ReleaseFile="calico-windows-v3.26.0.zip", [parameter(Mandatory = $false)] $KubeVersion="1.28.1", [parameter(Mandatory = $false)] $DownloadOnly="no", [parameter(Mandatory = $false)] $StartCalico="yes",
[parameter(Mandatory = $false)] $AutoCreateServiceAccountTokenSecret="yes", [parameter(Mandatory = $false)] $Datastore="kubernetes", [parameter(Mandatory = $false)] $EtcdEndpoints="", [parameter(Mandatory = $false)] $EtcdTlsSecretName="", [parameter(Mandatory = $false)] $EtcdKey="", [parameter(Mandatory = $false)] $EtcdCert="", [parameter(Mandatory = $false)] $EtcdCaCert="", [parameter(Mandatory = $false)] $ServiceCidr="10.49.0.0/16", [parameter(Mandatory = $false)] $DNSServerIPs="10.49.0.1", [parameter(Mandatory = $false)] $CalicoBackend="windows-bgp"`
Sadly not. I had some lengthy discussions with @coutinhop, but we did not find any solution to this problem as of right now.
I got it working, I missed some BGP configuration for windows. https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/kubernetes/standard#install-calico-and-kubernetes-on-windows-nodes
Everything is working as expected now. All my pods can talk to each other, resolve all DNS names in both Linux and Windows nodes.
Since it's Calico with BGP, I integrated it with my router and now I can even reach k8s resources outside the cluster with pod/service/loadblancer IP in both windows and Linux. I love it!
We work with VXLAN as well, with windows 2022 our pod can't ping any external service for example google.com
Or one pod can ping and the other one can't.
We also noticed that if one pod can ping google.com and we spin up another pod (or even spin down) the working pod will stop communicating 😔
For our debugging we have created a server in our network and tried to ping it the icmp packet gets to the server and the server replies, but it seems that the node just drops the packets and the packet doesn't reach the pod.
I got it working, I missed some BGP configuration for windows. https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/kubernetes/standard#install-calico-and-kubernetes-on-windows-nodes
Everything is working as expected now. All my pods can talk to each other, resolve all DNS names in both Linux and Windows nodes.
Since it's Calico with BGP, I integrated it with my router and now I can even reach k8s resources outside the cluster with pod/service/loadblancer IP in both windows and Linux. I love it!
Thanks @pramodsharma62! @mkaring could you please double check these BGP configurations on your setup? That link no longer works due to our docs having been reorganized, but this one does: https://github.com/projectcalico/calico/issues/7875#issuecomment-1799252658
@isarns this seems like a different issue since you're using VXLAN. Could you please open an issue and fill in the template? Try to include calico-node logs if possible (that'd be the first thing I'd ask anyway). Sorry, just trying to keep discussions separate.
@coutinhop I'm using VXLAN as well. I checked the if I did any steps of BGP steps at some point, to see if something is messing with the networking.
Nothing in relation to BGP is installed.
Get-WindowsFeature RemoteAccess, RSAT-RemoteAccess-PowerShell, Routing
Display Name Name Install State
------------ ---- -------------
[ ] Remotezugriff RemoteAccess Available
[ ] Routing Routing Available
[X] Remotezugriffsmodul für Windows Powe... RSAT-RemoteAccess-Po... Installed
The remote access is not enabled as well:
Get-RemoteAccess
DAStatus : Uninstalled
VpnStatus : Uninstalled
VpnS2SStatus : Uninstalled
SstpProxyStatus : Uninstalled
RoutingStatus : Uninstalled
LoadBalancing :
InternetInterface :
InternalInterface :
SslCertificate :
I checked the installation default
and made sure that calicoNetwork
→ bgp
is set to disabled.
Expected Behavior
Following the guides of Calico to setup the Kubernetes node on Windows Server 2022 should result in a working network that is able to communicate with the rest of the cluster and the outside world just fine.
The guide followed can be found here: Install Calico for Windows
Current Behavior
The problem is that the communication works just fine outbound. I'm testing the command
nslookup google.com
on a Windows Server 2022 container running on the node, setup by the Calico scripts. The CoreDNS on the control plane is receiving the request and responding to it, but the response never reaches the container that requested the DNS resolution.I'm using VXLAN as overlay for the connection. Reason for me choosing this is because the documentation of BGP sounds scatchy and I can't have the outside network know about my pods running in any way.
To be receive, no communication of any kind seems to be reaching the pod:
The general details about my nodes:
The IP Pool is setup correctly to my understanding:
The logs don't show any problems to me, but I'm running out of ideas what to look for. So if any log from any component would be helpful tracking the issue: Tell me, I get it.
Now what I did find out as potential issue is the following: I did capture the packages on the ethernet link on the Windows Server (not the Hyper-V connection, the physical one). This showed the packages of the DNS request encapsulated by VXLAN. The outbound packages appear fine. The actual package has the proper IP and MAC addresses for the physical servers hosting the kubernetes nodes. The headers inside the VXLAN encapsulation contain the IP and MAC of the containers communicating as expected. How ever the response package with the DNS resolution (it reaches the physical ethernet connection) it addressed inside the VXLAN encapsulation to the IP of the correct container, but to a different MAC. The destination mac matches the MAC of the Hyper-V switch. This seems wrong, but I have no idea what could be causing this issue.
For testing I did start a pod on the linux control-plane server and tried the DNS resolution here. This works just fine.
Steps to Reproduce (for bugs)
For the linux side, of things, the installation is done using the helm chart as described here: Install using Helm The additional configuration is:
Follow the guide here exactly: Install Calico for Windows The setup encapsulation is VXLAN, datastore is Kubernetes.
The alternative installation for calico using the HostProcess containers, also works and has the same outcome.
Context
In general I'm "only" trying the setup a Kubernetes with a Windows Node up.
Your Environment
Final words
At this point I have exhausted all ideas what the reason for this could be. I'd be thankful for any pointers where I could still look for problems.