projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

Calico Windows - Could not find vxlan0 in V2 #9120

Closed sriram-govindarajan89 closed 2 days ago

sriram-govindarajan89 commented 1 month ago

Expected Behavior

Calico on windows node runs windows pod seamlessly

Current Behavior

Calico on windows node identifies a HNS network but immediately after says unable to create hns network for a pod/container.

Possible Solution

Unknown

Steps to Reproduce (for bugs)

  1. Setup calico install via operator on control plane
  2. Add windows node to the cluster as per https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/operator
  3. Calico node setup completes successfully.
  4. deploy a sample windows deployment
  5. Pod fails to come up due to CNI errors.
  6. Calico Node logs: 2024-08-09T21:24:47.6978785-07:00 stdout F 2024-08-09 21:24:47.697 [INFO][13468] startup/startup_windows.go 78: Backend networking is vxlan, ensure vxlan network. 2024-08-09T21:24:47.7089719-07:00 stdout F 2024-08-09 21:24:47.708 [INFO][13468] startup/dataplane_windows.go 326: Found existing HNS network [&{Id:74519449-3C37-4FAB-8630-0FC523540BDA Name:Calico Type:Overlay NetworkAdapterName: SourceMac: Policies:[] MacPools:[{StartMacAddress:00-15-5D-30-D0-00 EndMacAddress:00-15-5D-30-DF-FF}] Subnets:[{AddressPrefix:192.168.30.128/26 GatewayAddress:192.168.30.129 Policies:[[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 86 83 73 68 34 58 52 48 57 54 125]]}] DNSSuffix: DNSServerList: DNSServerCompartment:8 ManagementIP:10.43.38.200 AutomaticDNS:false}] subnet="192.168.30.128/26" 2024-08-09T21:24:47.709889-07:00 stdout F 2024-08-09 21:24:47.709 [ERROR][13468] startup/dataplane_windows.go 119: Unable to create hns network Calico subnet="192.168.30.128/26" 2024-08-09T21:24:47.709889-07:00 stdout F 2024-08-09 21:24:47.709 [ERROR][13468] startup/startup.go 218: Unable to ensure network for os error=Could not find vxlan0 in V2: Network ID "74519449-3C37-4FAB-8630-0FC523540BDA" not found 2024-08-09T21:24:47.709889-07:00 stdout F 2024-08-09 21:24:47.709 [WARNING][13468] startup/utils.go 48: Terminating 2024-08-09T21:24:47.7173952-07:00 stdout F Calico node initialisation failed, will retry...

Any pointers on this is greatly appreciated.

Context

This is blocking setting up of a hybrid kubernetes cluster with calico cni using vxlan

Your Environment

coutinhop commented 1 month ago

@sriram-govindarajan89 could you provide some more logs and details? Full logs from the install-cni initContainer, felix and node containers, preferably with debug enabled?

sriram-govindarajan89 commented 4 weeks ago

@coutinhop Thanks a lot for the response.

My windows worker node is a physical server and appears to have had networking problems due to teaming configurations. After that was fixed and stale hns networks were manually cleared (requried restarting hns service), the node no longer has this issue.

I am going to have few more windows worker nodes (Virtual machines) attached to see if i can reproduce the issue again.

I will report back on this thread with findings.

sriram-govindarajan89 commented 3 weeks ago

I added 2 more windows nodes and have been unable to replicate this issue. However, i am seeing communication issues from any windows pods to pods on any other nodes.

I do see hnsendpoints for the pods though, but still any kind of communication is failing. I can raise this as a separate issue if i should. Any thoughts on this?

PS C:> test-netconnection 192.168.210.71 -p 80 WARNING: TCP connect to (192.168.210.71 : 80) failed WARNING: Ping to 192.168.210.71 failed with status: 11050

ComputerName : 192.168.210.71 RemoteAddress : 192.168.210.71 RemotePort : 80 InterfaceAlias : vEthernet (f550ca5c4ba46f56b71321ba521f1bbae057c3406161afeeeb7f702e01a805e4_Calico) SourceAddress : 192.168.157.143 PingSucceeded : False PingReplyDetails (RTT) : 0 ms TcpTestSucceeded : False

coutinhop commented 2 days ago

@sriram-govindarajan89 please do raise this as a separate issue and provide more details (logs, preferably with debug enabled, repro steps, ippool yaml, felixconfig yaml, any policies you might have in place, etc)?