VXLAN mode not working properly

rgarcia89 commented 4 years ago

I am running multiple Kubernetes Clusters which a spread between in-house infrastructure and cloud providers like (AWS and GCP). On all this clusters calico is running in IPIP mode. As we want to increase our provider variety we are planing to add Azure. So far the cluster setup as well as the installation of calico went fine. However, we are seeing issues in the routing. To test this, we created a pod which runs on Node A. CoreDNS is running on Node A and B. If the test pod on Node A tries to perform nslookups against the CoreDNS pod on Node A, everything works fine. If the test pod on Node A tries to perform nslookups against the CoreDNS pod on Node B we see timeouts from time to time. This results in DNS requests not being answered from time to time. I have run the same test in all our other kubernetes clusters that we host in-house and on AWS. Non of them showed that issue.

Expected Behavior

I would expect no timeout to happen, Independent of where the CoreDNS pod is hosted.

Current Behavior

dnsutils pod is hosted on 21-275-821-1-232fa6c7

Overview of the coredns pods, their IP and the node on which they are hosted

pod/coredns-68655454c5-6xskm                          1/1     Running   5          6d19h   10.101.14.46   5-21-275-821-1-232fa6c7
pod/coredns-68655454c5-xf82h                          1/1     Running   5          6d20h   10.101.7.187   5-21-275-823-1-232fa7e5

To CoreDNS pod hosted on the same node

Tue Mar 31 12:56:55 UTC 2020
/ # while sleep .1; do nslookup github.com 10.101.14.46 | grep timed; done
^C
/ # date
Tue Mar 31 12:58:24 UTC 2020

To CoreDNS pod hosted on another node

Tue Mar 31 12:59:43 UTC 2020
/ # while sleep .1; do nslookup github.com 10.101.7.187 | grep timed; done
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
^C
/ # date
Tue Mar 31 13:01:46 UTC 2020

Further Details

Calico deployment file: https://transfer.sh/vfFSP/calico-typha3131Cluster.yaml

Possible Solution

Steps to Reproduce (for bugs)

Shown above.

Context

Your Environment

Calico version: 3.13.3
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.17.4
Operating System and version: CentOS 7.7 (kernel: 5.5.2-1.el7)
Link to your project (optional):

fasaxc commented 4 years ago

Asked @rgarcia89 to try the test again with calico-node removed in order to see whether it's an underlying problem or something that calico is actively causing. Looks like it still occurs even with calico-node removed.

spikecurtis commented 4 years ago

@fasaxc when you say "with calico-node removed" do you mean with a different CNI plugin, or with Calico CNI but no calico-node?

rgarcia89 commented 4 years ago

@spikecurtis are you also experiencing the same issue?

He ment with removed calico-node daemonset

spikecurtis commented 4 years ago

No, I'm just trying to understand the issue and what debugging has already been done.

Is this an issue unique to DNS, or are there other services that you're seeing issues like this?

rgarcia89 commented 4 years ago

@spikecurtis In that case you best look into the Calico / kubernetes Slack Channel We started with the troubleshooting here: https://calicousers.slack.com/archives/C0BCA117T/p1585660187033100?thread_ts=1585154465.009500&cid=C0BCA117T and the current end is here: https://calicousers.slack.com/archives/C0BCA117T/p1585669884064400

rgarcia89 commented 4 years ago

It is an issue to any package that uses the VXLAN overlay

fasaxc commented 4 years ago

Microsoft support said to try setting the MTU down to 1400 since that's the MTU of the underlying fabric. @rgarcia89 tried setting the MTU on eth0 to 1400 and our CNI MTU down to 1350 but still seeing issues.

I asked if he tried restarting pods after the CNI change.

@rgarcia89 you also need to set the felix config parameter VXLANMTU to match the CNI one.

If we can get your set-up working, it'll help us improve our MTU detection etc. in future so please bear with us :-)

rgarcia89 commented 4 years ago

@fasaxc I deleted and redeployed all namespaces after deploying the new MTU and also restarted all nodes.

            - name: FELIX_VXLANMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu

Has been also added to the deployment. Used yaml can be found here https://transfer.sh/105WYo/calico-typha3131Cluster.yaml

I have no clue what else can be done... The issue is not showing up always. Without making any kind of changes or deployments on the cluster it out of the sudden starts. Yesterday, I had the loop running for almost 4 hours without a timeout showing up - then it started again.

We have decided today to go with AWS for this cluster using IPIP mode. However, I am happy to continue troubleshooting this, if someone has an idea. We can also setup a call / screenshare if this helps.

spikecurtis commented 4 years ago

@fasaxc do we have some mechanism in mind for how MTU misconfiguration can lead to an intermittent failure like this?

It looks like the test is running a DNS lookup in a loop against the same domain, so it seems like we should be getting consistent packet sizes.

fasaxc commented 4 years ago

@spikecurtis not unless it's a domain that varies its response (say load balancing with a set of IPs that scales up and down)

rgarcia89 commented 4 years ago

This is what I just received from the Azure support team...

Hi Raul, Sorry for the delay in response. I actually was trying to get the support from the concerned team and took time.

We got an update as at the moment that Azure containers team that only supports AKS azure kubernetes services, Self-created/managed k8s clusters on-prem or in azure are not supported by them. I wanted to assist you in the best way possible however I am afraid that further debug is not possible this out of our scope.

You need to open a case with Calico issue for assistance https://github.com/projectcalico/calico/issues

fasaxc commented 4 years ago

@rgarcia89 Can you give us more details of how you set up your k8s cluster? What installer did you use? How is the Azure network configured (type of network, subnet config etc)?

Doees this repro if you spin up a two-node cluster with say kubeadm (that's the easiest for our team to use to repro)?

rgarcia89 commented 4 years ago

@fasaxc I will provide you the requested information during the coming days.

caseydavenport commented 4 years ago

@rgarcia89 any update on this issue?

rgarcia89 commented 4 years ago

Hi guys I have let run a test cluster using calico version 3.13.3 over the weekend collecting the logs of 50 parallel pods that were started every 5 minutes and run a dig against github.com. No sure right now it this be due to the calico update. However, I am currently not able to invest more time into this. So far it looks working properly on this new provisioned test cluster.

I would have been happy to provide you further details about the experienced issue. However, I think we are also fine with the outcome of my latest test using version 3.13.3. Have there been any updates that could explain this?

fasaxc commented 4 years ago

@rgarcia89 We fixed a VXLAN route calculation bug: https://github.com/projectcalico/felix/pull/2260 but that'd be persistent failures, rather than just one or two packets, I think.

caseydavenport commented 4 years ago

Going to close this for now as not reproducible but please shout if you see issues again.

manojmaharanacore42 commented 3 months ago

hello, Did you find any solutions for these , i am facing the same issue here with the calico networking where I am extending the cluster to Azure

fasaxc commented 3 months ago

@manojmaharanacore42 Please open a fresh issue; this one is very old.

projectcalico / calico