Closed xinfengliu closed 3 weeks ago
Probably one for @coutinhop to take a look at
the error returned by hcsshim.GetHNSEndpointByName(epName) seems not able to distinguish “there’s no such endpoint” and “there’s such endpoint but something wrong with access HNS”. So it's possible that calico-node creates another Calico_ep HNS endpoint when getting errors from hcsshim.
@xinfengliu the way I see that happening is an err
being returned from HNSListEndpointRequest()
: https://github.com/microsoft/hcsshim/blob/59e8375cfad4883ea18bc75b765bc4cb64cb7b6b/internal/hns/hnsendpoint.go#L111
I'm not too familiar with the internals of hcsshim, but looks like a transient failure could result in an error there even though the endpoint exists... Do you have more details on the circumstances of this issue? More details on your customer's setup? Do you how often this happens? We would ideally need to repro to see if the API call failure is caused by some specific factor...
@coutinhop
Thanks for looking into this issue.
Do you have more details on the circumstances of this issue? More details on your customer's setup? Do you how often this happens? We would ideally need to repro to see if the API call failure is caused by some specific factor...
Sorry I don't have more details. We have some customers using Windows Calico on kubernetes but only have seen this issue in one customer, and this customer could not reproduce the issue on demand in their cluster either. From customer's words, the issue is more likely to happen when the system resource usage (CPU/memory) is very high.
the issue is more likely to happen when the system resource usage (CPU/memory) is very high.
@xinfengliu I'm speculating here, but it could be that calico is being force-killed by kubelet and doesn't clean up the HNS endpoint, then when it comes back up it duplicates it. Hard to tell without being able to reproduce it, though...
@xinfengliu I'm closing this as "can't reproduce", feel free to reopen in case you are able to gather more info.
Expected Behavior
There should be only one
Calico_ep
HNS endpoints on each Windows node.Current Behavior
Two
Calico_ep
HNS endpoints are found in hnsdiag.txt.gzPossible Solution
Suspect it is related to
createAndAttachVxlanHostEP
(https://github.com/projectcalico/calico/blob/97f526c00bf208aaf5dd05135a03a998bf05dd44/cni-plugin/pkg/dataplane/windows/dataplane_windows.go#L509) , the error returned byhcsshim.GetHNSEndpointByName(epName)
seems not able to distinguish “there’s no such endpoint” and “there’s such endpoint but something wrong with access HNS”. So it's possible that calico-node creates anotherCalico_ep
HNS endpoint when getting errors from hcsshim.Steps to Reproduce (for bugs)
I'm not able to reproduce the issue. The issue is found in a customer's environment.
Context
The issue affects determining source VIP used by
kube-proxy
.Your Environment