projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

[windows/k8s/cni] calico cni supports k8s pod of windows hyper-v mode ? #7490

Open danyang-intel opened 1 year ago

danyang-intel commented 1 year ago

Hi Team I'd like to consult if calico cni supports k8s pod of windows hyper-v mode ? Thank you for the help. The config of my k8s cluster: 1 master: ubuntu18.4 1 worker: windows 2019 k8s version: 1.20.2 (the latest version supporting windows hyper-v mode container) calico cni: 3.21 (the latest version supporting k8s 1.20), vxlan mode calico cni is installed by referring https://docs.tigera.io/archive/v3.21/getting-started/windows-calico/quickstart

My pod works well in windows process mode. While the pod in hyper mode failed to set network interface, after kubelet open the feature gate HyperVContainer=true.

docker exec -it k8s_k8s-test-108-hv_k8s-test-108-hv-... powershell PS C:> ipconfig

Windows IP Configuration

PS C:\> Get-NetAdapterAdvancedProperty
PS C:\> Get-NetAdapter
PS C:\>

I don't see error in kubelet log, but there is error in calico-felix log: image

Could you please advise if this looks like a calico cni issue? Thank you.

danyang-intel commented 1 year ago

C:\k\cni\config\10-calico.conf { "name": "Calico", "windows_use_single_network": true,

"cniVersion": "0.3.1", "type": "calico", "mode": "vxlan",

"vxlan_mac_prefix": "0E-2A", "vxlan_vni": 4096,

"policy": { "type": "k8s" },

"log_level": "info",

"capabilities": {"dns": true},

"DNS": { "Nameservers": ["10.96.0.10"], "Search": [ "svc.cluster.local" ] },

"nodename_file": "C:\CalicoWindows\libs\calico\..\..\nodename",

"datastore_type": "kubernetes",

"etcd_endpoints": "", "etcd_key_file": "", "etcd_cert_file": "", "etcd_ca_cert_file": "",

"kubernetes": { "kubeconfig": "C:\CalicoWindows\calico-kube-config" },

"ipam": { "type": "calico-ipam", "subnet": "usePodCidr" },

"policies": [ { "Name": "EndpointPolicy", "Value": { "Type": "OutBoundNAT", "ExceptionList": [ "10.96.0.0/12" ] } }, { "Name": "EndpointPolicy", "Value": { "Type": "ROUTE", "DestinationPrefix": "10.96.0.0/12", "NeedEncap": true } } ] }

rahulgupta999 commented 1 year ago

In our case containers are successfully created in both process and hyper-v isolation mode. however, there is no network inside hyper-v containers. The log messages are the same as above. calico windows config

2023-03-26 15:10:28.132 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:28.166 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:28.166 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:28.166 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:28.195 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:28.195 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:28.195 [INFO][6312] felix/endpoint_mgr.go 516: Could not resolve hns endpoint id ip="172.21.38.34/32" 2023-03-26 15:10:28.195 [WARNING][6312] felix/endpoint_mgr.go 351: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"ns-dev-cxp-common-3p-sampleengine/deploy-dev-sampleengine-1-0-0-dd79cfbd9-crk6j", EndpointId:"eth0"} 2023-03-26 15:10:28.201 [WARNING][6312] felix/endpoint_mgr.go 393: Failed to look up one or more HNS endpoints; will schedule a retry 2023-03-26 15:10:28.201 [WARNING][6312] felix/win_dataplane.go 348: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found 2023-03-26 15:10:28.201 [INFO][6312] felix/win_dataplane.go 314: Finished applying updates to dataplane. msecToApply=68.85520000000001 2023-03-26 15:10:33.204 [INFO][6312] felix/win_dataplane.go 307: Applying dataplane updates 2023-03-26 15:10:33.204 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:33.244 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:33.244 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:33.244 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:33.275 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:33.275 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:33.275 [INFO][6312] felix/endpoint_mgr.go 516: Could not resolve hns endpoint id ip="172.21.38.34/32" 2023-03-26 15:10:33.276 [WARNING][6312] felix/endpoint_mgr.go 351: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"ns-dev-cxp-common-3p-sampleengine/deploy-dev-sampleengine-1-0-0-dd79cfbd9-crk6j", EndpointId:"eth0"} 2023-03-26 15:10:33.276 [WARNING][6312] felix/endpoint_mgr.go 393: Failed to look up one or more HNS endpoints; will schedule a retry 2023-03-26 15:10:33.276 [WARNING][6312] felix/win_dataplane.go 348: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found 2023-03-26 15:10:33.277 [INFO][6312] felix/win_dataplane.go 314: Finished applying updates to dataplane. msecToApply=72.7042 2023-03-26 15:10:38.316 [INFO][6312] felix/win_dataplane.go 307: Applying dataplane updates 2023-03-26 15:10:38.316 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:38.353 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:38.354 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:38.354 [INFO][6312] felix/endpoint_mgr.go 162: Refreshing the endpoint cache 2023-03-26 15:10:38.385 [WARNING][6312] felix/endpoint_mgr.go 207: This is a stale endpoint with no container attached id="762efc38-e675-4b73-9b95-c116d88d0d02" name="d4ab04980be1ad060a3817f09c34d6b1d745c92b8aef9449d340c80358bd455a_Calico" 2023-03-26 15:10:38.385 [INFO][6312] felix/endpoint_mgr.go 226: Cache refresh is complete. 2 endpoints were cached 2023-03-26 15:10:38.385 [INFO][6312] felix/endpoint_mgr.go 516: Could not resolve hns endpoint id ip="172.21.38.34/32" 2023-03-26 15:10:38.385 [WARNING][6312] felix/endpoint_mgr.go 351: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"ns-dev-cxp-common-3p-sampleengine/deploy-dev-sampleengine-1-0-0-dd79cfbd9-crk6j", EndpointId:"eth0"} 2023-03-26 15:10:38.385 [WARNING][6312] felix/endpoint_mgr.go 393: Failed to look up one or more HNS endpoints; will schedule a retry 2023-03-26 15:10:38.385 [WARNING][6312] felix/win_dataplane.go 348: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found 2023-03-26 15:10:38.385 [INFO][6312] felix/win_dataplane.go 314: Finished applying updates to dataplane. msecToApply=69.4734

rahulgupta999 commented 1 year ago

I think the issue lies in "RefreshHnsEndpointCache" method in hns_windows.go where it is looking for containers attached to a HnSEndpoint. In the case of Hyper-v, a virtual machine is attached to the HNSEndpoint by containerd. Get-HNsEndpoint

image
danyang-intel commented 1 year ago

Good catch! Thank you for the help, @rahulgupta999 In my case, I don't see VirtureMachine item in Get-HNsEndpoint. I'll look into hns_windows.go. Thank you again.

Screenshot (74)

coutinhop commented 1 year ago

@danyang-intel and @rahulgupta999, thanks for spotting this and digging into it... I'm not sure as v3.21/k8s v1.20 predate me working on calico+windows, but @song-jiang might know this. It looks like hns_windows.go is only a shim to https://github.com/microsoft/hcsshim/, but I'm not very familiar with it. Would you know if this is something hcsshim supports and/or only minor changes to hns_windows.go are needed?

davhdavh commented 1 year ago

it has same problem in calico cni: 3.25.1 and k8s 1.27.2 any workarounds?