projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.92k stars 1.32k forks source link

Cannot connect to Pod IP when windows hns service restarted #5164

Open kais271 opened 2 years ago

kais271 commented 2 years ago

Expected Behavior

Connections can be automatically restored.

Current Behavior

When hns service restarted:

  1. The pod still running and hold that IP, but hnsendpoints has been removed and is not automatically re-create.
  2. The calico-felix logging 'Failed to look up one or more HNS endpoints; will schedule a retry', and that pod cannot to connect.

Possible Solution

N/A

Steps to Reproduce (for bugs)

1.Have k8s windows node and a running pod 2.hnsdiag.exe list endpoints 3.on windows node restart hns service 4.hnsdiag.exe list endpoints

Context

Your Environment

song-jiang commented 2 years ago

@kais271 Thanks for reporting this issue.

Normally it is the CNI plugin who creates/deletes the endpoint and calico/felix service implements network policies onto it. What CNI you are using? Calico?

calico/felix can't recreate missing endpoints as those are managed by other CNI plugins. You would need to find another way for CNI plugins to get endpoint created again. Restart kubelet may do the job but I'm not 100 sure.

kais271 commented 2 years ago

@song-jiang Hi sir, I using calico 3.20.2. Through further understanding, the hnsendpoint created by docker. When hns service restarted, felix log will record 'Not found endpoint'.

song-jiang commented 2 years ago

If you are using Calico with Calico CNI plugin, the hnsendpoint is created by Calico CNI plugin. The sequence of actions are

user creats a pod --> kubelet receives pod info and invokes Calico CNI plugin --> Calico CNI plugin creates hnsendpoint._ 

You may check kubelet logs to see the details.

kais271 commented 2 years ago

Thank you. We refer to this doc to deploy the calico. https://projectcalico.docs.tigera.io/getting-started/windows-calico/kubernetes/standard As above sequence, the pod still running when hns restarted, so kubelet did not find that the pod’s network is unavailable, so it did not notify Calico CNI plugin to recreate hnsendpoint. We can see the following in felix log when hns restarted. So Calico know the endpoint has been removed, but does not know the pod status, as long as kubelet does not notify Calico to recreate, it will not have any action. 2021-12-15 21:46:37.184 [INFO][6824] felix/endpoint_mgr.go 220: HNS endpoint removed from cache id="10.5.87.85/32" Am I right?

kais271 commented 2 years ago

Add some kubelet logs

create pod

I1215 22:10:33.992410    1760 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-cw2sc\" (UniqueName: \"kubernetes.io/projected/4d65ec63-b272-41b4-a16a-063288aeb00d-kube-api-access-cw2sc\") pod \"iis2-76f8647c6d-6t4dj\" (UID: \"4d65ec63-b272-41b4-a16a-063288aeb00d\") "
**2021-12-15 22:10:38.425 [INFO][7148] plugin.go 261: Calico CNI found existing endpoint**: &{{WorkloadEndpoint projectcalico.org/v3} {win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0 iis2-76f8647c6d- demo-n1  4d65ec63-b272-41b4-a16a-063288aeb00d 2545077 0 2021-12-15 22:10:33 -0800 PST <nil> <nil> map[pod-template-hash:76f8647c6d project.cpaas.io/name:demo
projectcalico.org/namespace:demo-n1 projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:default service.cpaas.io/name:deployment-iis2] map[] [] []  []} {k8s  win-node-5  iis2-76f8647c6d-6t4dj eth0 default [] []   [kns.demo-n1 ksa.demo-n1.default] cali8685d2b43ad  []}}
ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-"
2021-12-15 22:10:38.425 [INFO][7148] k8s.go 72: Extracted identifiers for CmdAddK8s ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:38.633 [INFO][8012] ipam_plugin.go 265: Auto assigning IP ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" HandleID="Calico.ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Workload="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0" assignArgs=ipam.AutoAssignArgs{Num4:1, Num6:0,
HandleID:(*string)(0xc000265fa0), Attrs:map[string]string{"namespace":"demo-n1", "node":"win-node-5", "pod":"iis2-76f8647c6d-6t4dj", "timestamp":"2021-12-16 06:10:38.6085326 +0000 UTC"}, Hostname:"win-node-5", IPv4Pools:[]net.IPNet{}, IPv6Pools:[]net.IPNet{}, MaxBlocksPerHost:1, HostReservedAttrIPv4s:(*ipam.HostReservedAttr)(0xc0007cf6b0),
HostReservedAttrIPv6s:(*ipam.HostReservedAttr)(nil)}
2021-12-15 22:10:38.837 [INFO][7148] k8s.go 375: Populated endpoint ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint",
APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0", GenerateName:"iis2-76f8647c6d-", Namespace:"demo-n1", SelfLink:"", UID:"4d65ec63-b272-41b4-a16a-063288aeb00d", ResourceVersion:"2545077", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63775231833,
loc:(*time.Location)(0x30953c0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"pod-template-hash":"76f8647c6d", "project.cpaas.io/name":"demo", "projectcalico.org/namespace":"demo-n1", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default",
"service.cpaas.io/name":"deployment-iis2"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"win-node-5", ContainerID:"", Pod:"iis2-76f8647c6d-6t4dj", Endpoint:"eth0",
ServiceAccountName:"default", IPNetworks:[]string{"10.5.87.102/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.demo-n1", "ksa.demo-n1.default"}, InterfaceName:"cali8685d2b43ad", MAC:"", Ports:[]v3.EndpointPort(nil)}}
2021-12-15 22:10:38.837 [INFO][7148] k8s.go 376: Calico CNI using IPs: [10.5.87.102/32] ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:38.897 [INFO][7148] dataplane_windows.go 186: Overriding network name, only a single IPAM block will be supported on this host ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0" name="Calico"
2021-12-15 22:10:38.924 [INFO][7148] dataplane_windows.go 325: Found existing HNS network [&{Id:968D4105-FBBB-467A-8A37-5D51BA98764E Name:Calico Type:Overlay NetworkAdapterName: SourceMac: Policies:[[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 53 46 50 49 52 46 49 50 56 47 50 54 34 44 34 68 105 115 116 114 105 98
117 116 101 100 82 111 117 116 101 114 77 97 99 65 100 100 114 101 115 115 34 58 34 48 48 45 49 53 45 53 100 45 49 101 45 99 48 45 53 49 34 44 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 44 34 80 114 111 118 105 100 101 114 65 100 100 114 101 115 115 34 58 34 49 57 50 46 49 54 56 46 56 53 46 49 53 57 34 44 34 84 121 112 101 34 58 34 82 101
109 111 116 101 83 117 98 110 101 116 82 111 117 116 101 34 125] [123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 53 46 50 52 54 46 48 47 50 54 34 44 34 68 105 115 116 114 105 98 117 116 101 100 82 111 117 116 101 114 77 97 99 65 100 100 114 101 115 115 34 58 34 54 54 45 54 53 45 99 51 45 102 101 45 56 101 45 55 48 34
44 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 44 34 80 114 111 118 105 100 101 114 65 100 100 114 101 115 115 34 58 34 49 57 50 46 49 54 56 46 56 53 46 49 53 55 34 44 34 84 121 112 101 34 58 34 82 101 109 111 116 101 83 117 98 110 101 116 82 111 117 116 101 34 125] [123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34
58 34 49 48 46 53 46 49 55 57 46 49 50 56 47 50 54 34 44 34 68 105 115 116 114 105 98 117 116 101 100 82 111 117 116 101 114 77 97 99 65 100 100 114 101 115 115 34 58 34 48 48 45 49 53 45 53 100 45 49 55 45 102 97 45 51 102 34 44 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 44 34 80 114 111 118 105 100 101 114 65 100 100 114 101 115 115 34 58
34 49 57 50 46 49 54 56 46 56 53 46 49 54 49 34 44 34 84 121 112 101 34 58 34 82 101 109 111 116 101 83 117 98 110 101 116 82 111 117 116 101 34 125] [123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 53 46 50 50 46 48 47 50 54 34 44 34 68 105 115 116 114 105 98 117 116 101 100 82 111 117 116 101 114 77 97 99 65 100 100
114 101 115 115 34 58 34 48 48 45 49 53 45 53 100 45 98 49 45 56 101 45 52 48 34 44 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 44 34 80 114 111 118 105 100 101 114 65 100 100 114 101 115 115 34 58 34 49 57 50 46 49 54 56 46 56 53 46 49 54 48 34 44 34 84 121 112 101 34 58 34 82 101 109 111 116 101 83 117 98 110 101 116 82 111 117 116 101 34
125] [123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 53 46 50 51 46 54 52 47 50 54 34 44 34 68 105 115 116 114 105 98 117 116 101 100 82 111 117 116 101 114 77 97 99 65 100 100 114 101 115 115 34 58 34 48 48 45 49 53 45 53 100 45 50 56 45 49 48 45 98 49 34 44 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57
54 44 34 80 114 111 118 105 100 101 114 65 100 100 114 101 115 115 34 58 34 49 57 50 46 49 54 56 46 56 53 46 49 53 56 34 44 34 84 121 112 101 34 58 34 82 101 109 111 116 101 83 117 98 110 101 116 82 111 117 116 101 34 125] [123 34 84 121 112 101 34 58 34 72 111 115 116 82 111 117 116 101 34 125]] MacPools:[{StartMacAddress:00-15-5D-B9-D0-00
EndMacAddress:00-15-5D-B9-DF-FF}] Subnets:[{AddressPrefix:10.5.87.64/26 GatewayAddress:10.5.87.65 Policies:[[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 86 83 73 68 34 58 52 48 57 54 125]]}] DNSSuffix: DNSServerList: DNSServerCompartment:5 ManagementIP:192.168.85.162 AutomaticDNS:false}]
ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:38.939 [INFO][7148] dataplane_windows.go 534: Deleted stale HNSEndpoint Calico_ep ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:39.004 [INFO][7148] dataplane_windows.go 552: Created HNSEndpoint Calico_ep ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:39.006 [INFO][7148] dataplane_windows.go 769: Attempting to create HNS endpoint name: ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico for container ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj"
WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:39.242 [INFO][7148] dataplane_windows.go 798: Endpoint to container created! &{17e097f2-8b60-443f-ac2e-205210954fe7 ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico 968d4105-fbbb-467a-8a37-5d51ba98764e Calico [[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 54 46 48 46 48 47 49 54 34 44 34
49 48 46 53 46 48 46 48 47 49 54 34 44 34 49 55 50 46 49 54 46 48 46 48 47 51 48 34 44 34 49 57 50 46 49 54 57 46 48 46 48 47 50 52 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125] [123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 54 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110
99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125] [123 34 80 65 34 58 34 49 57 50 46 49 54 56 46 56 53 46 49 54 50 34 44 34 84 121 112 101 34 58 34 80 65 34 125]] 0E-2A-0a-05-57-66 10.5.87.102 demo-n1.svc.cluster.local,svc.cluster.local,cluster.local 10.6.0.10 10.5.87.65 false false 26 false false <nil> 50}
ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:10:39.287 [INFO][7148] k8s.go 403: Added Mac, interface name, and active container ID to endpoint ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0", GenerateName:"iis2-76f8647c6d-", Namespace:"demo-n1", SelfLink:"", UID:"4d65ec63-b272-41b4-a16a-063288aeb00d", ResourceVersion:"2545077", Generation:0,
CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63775231833, loc:(*time.Location)(0x30953c0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"pod-template-hash":"76f8647c6d", "project.cpaas.io/name":"demo", "projectcalico.org/namespace":"demo-n1", "projectcalico.org/orchestrator":"k8s",
"projectcalico.org/serviceaccount":"default", "service.cpaas.io/name":"deployment-iis2"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"win-node-5",
ContainerID:"ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7", Pod:"iis2-76f8647c6d-6t4dj", Endpoint:"eth0", ServiceAccountName:"default", IPNetworks:[]string{"10.5.87.102/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.demo-n1", "ksa.demo-n1.default"}, InterfaceName:"cali8685d2b43ad",
MAC:"0e:2a:0a:05:57:66", Ports:[]v3.EndpointPort(nil)}}

**hns restarted and restart pod**  

2021-12-15 22:16:55.193 [INFO][10244] network_windows.go 243: Endpoint already networked, doing a lookup-only request. ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0"
2021-12-15 22:16:55.193 [INFO][10244] network_windows.go 250: Status lookup result ContainerID="ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" Namespace="demo-n1" Pod="iis2-76f8647c6d-6t4dj" WorkloadEndpoint="win--node--5-k8s-iis2--76f8647c6d--6t4dj-eth0" result=&current.Result{CNIVersion:"", Interfaces:[]*current.Interface(nil),
IPs:[]*current.IPConfig{(*current.IPConfig)(0xc0002c4480)}, Routes:[]*types.Route(nil), DNS:types.DNS{Nameservers:[]string(nil), Domain:"", Search:[]string(nil), Options:[]string(nil)}}
E1215 22:16:56.958611    1760 cni.go:361] "Error adding pod to network" err="Endpoint ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico not found" pod="demo-n1/iis2-76f8647c6d-6t4dj" podSandboxID={Type:docker ID:9a3d05cc94cd65f850fae1c110aa5d48db30db0640b2b6454d3688db71879ae6}
podNetnsPath="container:ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" networkType="calico" networkName="Calico"
I1215 22:16:56.958611    1760 **docker_sandbox.go:401] "Failed to read pod IP from plugin/docker"** err="networkPlugin cni failed on the status hook for pod \"iis2-76f8647c6d-6t4dj_demo-n1\": Endpoint ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico not found"
E1215 22:16:57.054466    1760 cni.go:361] "Error adding pod to network" err="Endpoint ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico not found" pod="demo-n1/iis2-76f8647c6d-6t4dj" podSandboxID={Type:docker ID:b51a9a4dd3f78a28ecca8e296e7e3642612069d754767b1e5ada8190d9f09447}
podNetnsPath="container:ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7" networkType="calico" networkName="Calico"
I1215 22:16:57.054466    1760 docker_sandbox.go:401] "Failed to read pod IP from plugin/docker" err="networkPlugin cni failed on the status hook for pod \"iis2-76f8647c6d-6t4dj_demo-n1\": Endpoint ab013a1c79bac957c378d892d76d7e64a2705f3638e3806e40f74f334e42b4e7_Calico not found"
song-jiang commented 2 years ago

@kais271 Sorry I missed your reply. It seems Calico CNI does not support HNS restart, that is not something we are testing. Could you share the reason why HNS service need to be restarted?

kais271 commented 2 years ago

Hi, the HNS service will be restarted unexpectedly. So I want to know, assuming the HNS service is unstable, does calico have a related solution? image

caseydavenport commented 2 years ago

@song-jiang this sounds like a bug to me - we need to be resilient to these sorts of error conditions (unless I'm misunderstanding something subtle here)