projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

Windows: duplicated Calico_ep HNS endpoints found #9142

Closed xinfengliu closed 3 weeks ago

xinfengliu commented 2 months ago

Expected Behavior

There should be only one Calico_ep HNS endpoints on each Windows node.

Current Behavior

Two Calico_ep HNS endpoints are found in hnsdiag.txt.gz

Possible Solution

Suspect it is related to createAndAttachVxlanHostEP (https://github.com/projectcalico/calico/blob/97f526c00bf208aaf5dd05135a03a998bf05dd44/cni-plugin/pkg/dataplane/windows/dataplane_windows.go#L509) , the error returned by hcsshim.GetHNSEndpointByName(epName) seems not able to distinguish “there’s no such endpoint” and “there’s such endpoint but something wrong with access HNS”. So it's possible that calico-node creates another Calico_ep HNS endpoint when getting errors from hcsshim.

Steps to Reproduce (for bugs)

I'm not able to reproduce the issue. The issue is found in a customer's environment.

Context

The issue affects determining source VIP used by kube-proxy.

Your Environment

caseydavenport commented 2 months ago

Probably one for @coutinhop to take a look at

coutinhop commented 2 months ago

the error returned by hcsshim.GetHNSEndpointByName(epName) seems not able to distinguish “there’s no such endpoint” and “there’s such endpoint but something wrong with access HNS”. So it's possible that calico-node creates another Calico_ep HNS endpoint when getting errors from hcsshim.

@xinfengliu the way I see that happening is an err being returned from HNSListEndpointRequest(): https://github.com/microsoft/hcsshim/blob/59e8375cfad4883ea18bc75b765bc4cb64cb7b6b/internal/hns/hnsendpoint.go#L111

I'm not too familiar with the internals of hcsshim, but looks like a transient failure could result in an error there even though the endpoint exists... Do you have more details on the circumstances of this issue? More details on your customer's setup? Do you how often this happens? We would ideally need to repro to see if the API call failure is caused by some specific factor...

xinfengliu commented 1 month ago

@coutinhop

Thanks for looking into this issue.

Do you have more details on the circumstances of this issue? More details on your customer's setup? Do you how often this happens? We would ideally need to repro to see if the API call failure is caused by some specific factor...

Sorry I don't have more details. We have some customers using Windows Calico on kubernetes but only have seen this issue in one customer, and this customer could not reproduce the issue on demand in their cluster either. From customer's words, the issue is more likely to happen when the system resource usage (CPU/memory) is very high.

coutinhop commented 1 month ago

the issue is more likely to happen when the system resource usage (CPU/memory) is very high.

@xinfengliu I'm speculating here, but it could be that calico is being force-killed by kubelet and doesn't clean up the HNS endpoint, then when it comes back up it duplicates it. Hard to tell without being able to reproduce it, though...

coutinhop commented 3 weeks ago

@xinfengliu I'm closing this as "can't reproduce", feel free to reopen in case you are able to gather more info.