Closed rtmie closed 2 years ago
Are you able to run a packet capture on one of the nodes that runs coreDNS and nodelocaldns and see the response? Any logs in the corresponding coreDNS pod?
/assign @rtmie
@prameshj I enabled logging via coredns core file.
No Fails seen in the coredns pod log (I do see some nxdomain denials but that is for a known missing service in my cluster, non related to the FQDNs in my dnsperf test).
However in the log of the node localdns pod I do see this error for each SERVFAIL in the dnsperf test:
"[ERROR] plugin/errors: 2 <external.fqdn.mydomain> A: dial tcp 192.168.128.10:53: i/o timeout\n"
--
(192.168.128.10 is the coredns service)
The i/o timeout most likely means coreDNS pod did not respond in 2 seconds. NodeLocalDNS(which uses coreDNS) has a 2s read timeout for forwarded requests. It sends the request via TCP to coreDNS.
Could this particular coreDNS pod be getting overloaded? How many requests per second is that pod getting? Perhaps the test is hitting the limits of the qps that can be handled.
In case of external FQDNS, the i/o timeout log message will have the upstream server IP.
The consistent behaviour that I see is that where I run the dnsperf load on a node that does not have coredns pod running I get zero i/o timeout errors in the node-local-dns. When I run it on a node where there is a coredns I see the tiomouts.
I wonder if conntrack causes the node-local-dns to hotspot the coredns that is local to it on the node causing it to overload? (we are using ipvs for kube-proxy).
Right now I have dnsperf running with qps of 5000 for 300 seconds.
I will test with a lower qps.
@prameshj is the 2 second timeout on the node-local-dns be configured?
@prameshj when I run DNS perf job qith QPS of 5k on a node where there is no coredns pod I get no SERVFAIL.
When I run on a node where coredns is running I reduce qps to 500 and still SEVFAIL @0.01%
I wonder if conntrack causes the node-local-dns to hotspot the coredns that is local to it on the node causing it to overload? (we are using ipvs for kube-proxy).
Can you check the dns_request_count metric on the various coreDNS pods and see if a specific one is getting more queries?
@prameshj is the 2 second timeout on the node-local-dns be configured?
I don't see it as configurable. The forward plugin sets an overall timeout of 5s to handle the query - https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/forward.go#L97
The connect call sets a write/read deadline of maxTimeout and readTImeout respectively. https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/connect.go#L99, https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/connect.go#L109
These are set to 2s right now. https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/persistent.go#L160, https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/proxy.go#L79
So, the behavior is - forward plugin will try to get a response from an upstream within 2s and contact a different one if there was no response from the previous. It will do this for 5s before failing the query. @chrisohaver is that accurate?
@prameshj thanks for the info -I was away for a few days. Will go through your feedback today/tomorrow.
I think I have the same issue.
There are quite a lot of timeout requests from nodelocaldns
to coredns
. (about 70% of DNS requests are timeout in worst case node)
And I just realized that only nodes having coredns
pod have this issue as @rtmie said.
@chrisohaver is that accurate?
Yes. A self adjusting dial timeout comes into play too, but that’s the general idea. None of it is configurable.
The i/o timeout most likely means coreDNS pod did not respond in 2 seconds. NodeLocalDNS(which uses coreDNS) has a 2s read timeout for forwarded requests. It sends the request via TCP to coreDNS.
In my case, I think coreDNS
is working fine (it does not make any timeout). However nodelocaldns
rarely makes any connection with coreDNS
service(192.168.0.3).
# conntrack -L | grep 192.168.0.3
tcp 6 108 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51422 dport=53 [UNREPLIED] src=192.168.89.1 dst=192.168.126.0 sport=53 dport=19944 mark=0 use=1
tcp 6 107 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51414 dport=53 [UNREPLIED] src=192.168.76.1 dst=192.168.126.0 sport=53 dport=45629 mark=0 use=1
tcp 6 110 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51438 dport=53 [UNREPLIED] src=192.168.87.1 dst=192.168.126.0 sport=53 dport=22125 mark=0 use=1
tcp 6 105 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51394 dport=53 [UNREPLIED] src=192.168.66.2 dst=192.168.126.0 sport=53 dport=42584 mark=0 use=1
tcp 6 111 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51444 dport=53 [UNREPLIED] src=192.168.117.1 dst=192.168.126.0 sport=53 dport=43203 mark=0 use=1
tcp 6 109 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51430 dport=53 [UNREPLIED] src=192.168.72.1 dst=192.168.126.0 sport=53 dport=48859 mark=0 use=1
tcp 6 7 TIME_WAIT src=192.168.0.3 dst=192.168.0.3 sport=50610 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=11486 [ASSURED] mark=0 use=1
tcp 6 103 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51372 dport=53 [UNREPLIED] src=192.168.84.1 dst=192.168.126.0 sport=53 dport=35905 mark=0 use=1
tcp 6 86392 ESTABLISHED src=192.168.0.3 dst=192.168.0.3 sport=51454 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=42359 [ASSURED] mark=0 use=1
udp 17 20 src=192.168.0.3 dst=192.168.0.3 sport=51515 dport=53 [UNREPLIED] src=192.168.84.1 dst=192.168.126.0 sport=53 dport=53046 mark=0 use=1
tcp 6 102 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51366 dport=53 [UNREPLIED] src=192.168.111.4 dst=192.168.126.0 sport=53 dport=39537 mark=0 use=1
tcp 6 106 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51404 dport=53 [UNREPLIED] src=192.168.93.1 dst=192.168.126.0 sport=53 dport=33050 mark=0 use=1
udp 17 16 src=192.168.0.3 dst=192.168.0.3 sport=49564 dport=53 [UNREPLIED] src=192.168.111.4 dst=192.168.126.0 sport=53 dport=18936 mark=0 use=1
udp 17 23 src=192.168.0.3 dst=192.168.0.3 sport=37426 dport=53 [UNREPLIED] src=192.168.124.1 dst=192.168.126.0 sport=53 dport=65084 mark=0 use=1
tcp 6 67 TIME_WAIT src=192.168.0.3 dst=192.168.0.3 sport=51056 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=55891 [ASSURED] mark=0 use=1
tcp 6 104 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51384 dport=53 [UNREPLIED] src=192.168.124.1 dst=192.168.126.0 sport=53 dport=58595 mark=0 use=1
udp 17 27 src=192.168.0.3 dst=192.168.0.3 sport=59777 dport=53 [UNREPLIED] src=192.168.66.2 dst=192.168.126.0 sport=53 dport=55549 mark=0 use=1
udp 17 13 src=192.168.0.3 dst=192.168.0.3 sport=32790 dport=53 [UNREPLIED] src=192.168.78.1 dst=192.168.126.0 sport=53 dport=41575 mark=0 use=1
Maybe I found a workaround for this - change ipvs scheduler to sh(source hash) from rr(round robin). However, I am not sure why it addresses this issue.
@rtmie @prameshj Any updates on this?
+1
Maybe I found a workaround for this - change ipvs scheduler to sh(source hash) from rr(round robin).
I am not too familiar with the IPVS loadbalancing modes, but Source hash looks up the destination based on the source IP address of the packet. In case of NodeLocalDNS reaching out to upstream, the source IP will always be the same (the node IP address, since NodeLocalDNS is in hostNetwork mode). It is possible that it picks the same coreDNS pod every-time as a result. With round-robin, it is more spread out.
Are you able to lookup the dns request count(exported by coreDNS pods, NOT nodelocaldns) grouped by coreDNS pod name? Are some pods getting way more than others? Is it possible that those coreDNS pods were scheduled first?
@prameshj
Are you able to lookup the dns request count(exported by coreDNS pods, NOT nodelocaldns) grouped by coreDNS pod name? Are some pods getting way more than others? Is it possible that those coreDNS pods were scheduled first?
Yes, one coredns pod is getting a lot of requests than others - which is not desirable for me.
It is possible that it picks the same coreDNS pod every-time as a result. With round-robin, it is more spread out.
Also, I am having another problem that DNS lookup from nodes to the coredns's service IP only fails.
https://calicousers.slack.com/archives/CPEPF833L/p1633007149311200 https://kubernetes.slack.com/archives/C09QYUH5W/p1632813013251600
DNS lookup
And also HTTP request from nodes to http://coredns's service IP/metrics
succeeds. (the coredns service is reachable by HTTP protocol)
As I know, nodelocaldns is supposed to use the host node's network namespace. So I doubt that nodelocaldns has ever succeeded in DNS lookup to the coredns's service IP.
EDIT: The issue turned out to be a timeout issue rather than a connectivity issue.
Do you also have NetworkPolicy rules configured? This could be an interaction between iptables rules setup for NodeLocalDNS and network policy maybe.
Is the issue reproducible in: 1) iptables mode? 2) With NodeLocalDNS disabled in IPVS mode? 3) WIth NetworkPolicy disabled(in case you have it enabled)?
Do you also have NetworkPolicy rules configured? This could be an interaction between iptables rules setup for NodeLocalDNS and network policy maybe.
$ k get globalnetworkpolicies.crd.projectcalico.org -A
No resources found
$ k get networkpolicies.crd.projectcalico.org -A
No resources found
$ k get networkpolicies.networking.k8s.io -A
No resources found
I think I don't have any NetworkPolicies as I know.
@gjkim42 - changing the ipvs scheduler in kube-proxy from rr
to sh
did not make a difference on my dnsperf test.
@prameshj wrt your comment on networkpolicies and impact on iptables - I am not clear on what I should be looking for there that would cause the effect I am seeing? We make extensive use of network policies.
@prameshj
Is the issue reproducible in:
- iptables mode?
Will disable nodelocaldns tomorrow.
There are many invalid records in the failing nodes. (Non-failing nodes have few invalid records)
# conntrack -S
cpu=0 found=0 invalid=685 ignore=11379 insert=0 insert_failed=0 drop=0 early_drop=0 error=677 search_restart=3
cpu=1 found=0 invalid=39 ignore=10805 insert=0 insert_failed=0 drop=0 early_drop=0 error=39 search_restart=5
cpu=2 found=0 invalid=214 ignore=10635 insert=0 insert_failed=0 drop=0 early_drop=0 error=212 search_restart=1
cpu=3 found=0 invalid=121 ignore=10403 insert=0 insert_failed=0 drop=0 early_drop=0 error=119 search_restart=6
cpu=4 found=0 invalid=289 ignore=10193 insert=0 insert_failed=0 drop=0 early_drop=0 error=289 search_restart=5
cpu=5 found=0 invalid=43 ignore=10656 insert=0 insert_failed=0 drop=0 early_drop=0 error=41 search_restart=0
cpu=6 found=0 invalid=64 ignore=10285 insert=0 insert_failed=0 drop=0 early_drop=0 error=64 search_restart=5
cpu=7 found=0 invalid=84 ignore=10538 insert=0 insert_failed=0 drop=0 early_drop=0 error=82 search_restart=8
cpu=8 found=0 invalid=274 ignore=11307 insert=0 insert_failed=0 drop=0 early_drop=0 error=274 search_restart=4
cpu=9 found=0 invalid=152 ignore=10371 insert=0 insert_failed=0 drop=0 early_drop=0 error=152 search_restart=4
cpu=10 found=0 invalid=103 ignore=10941 insert=0 insert_failed=0 drop=0 early_drop=0 error=103 search_restart=6
cpu=11 found=0 invalid=697 ignore=10919 insert=0 insert_failed=0 drop=0 early_drop=0 error=697 search_restart=3
cpu=12 found=0 invalid=303 ignore=9353 insert=0 insert_failed=0 drop=0 early_drop=0 error=303 search_restart=0
cpu=13 found=0 invalid=28 ignore=9726 insert=0 insert_failed=0 drop=0 early_drop=0 error=28 search_restart=2
cpu=14 found=0 invalid=414 ignore=9260 insert=0 insert_failed=0 drop=0 early_drop=0 error=414 search_restart=4
cpu=15 found=0 invalid=687 ignore=9864 insert=0 insert_failed=0 drop=0 early_drop=0 error=687 search_restart=2
cpu=16 found=0 invalid=113 ignore=8785 insert=0 insert_failed=0 drop=0 early_drop=0 error=113 search_restart=3
cpu=17 found=0 invalid=521 ignore=9293 insert=0 insert_failed=0 drop=0 early_drop=0 error=521 search_restart=0
cpu=18 found=0 invalid=986 ignore=9984 insert=0 insert_failed=0 drop=0 early_drop=0 error=986 search_restart=3
cpu=19 found=0 invalid=284 ignore=10705 insert=0 insert_failed=0 drop=0 early_drop=0 error=284 search_restart=1
cpu=20 found=0 invalid=59 ignore=9476 insert=0 insert_failed=0 drop=0 early_drop=0 error=59 search_restart=2
cpu=21 found=0 invalid=656 ignore=10483 insert=0 insert_failed=0 drop=0 early_drop=0 error=656 search_restart=4
cpu=22 found=0 invalid=277 ignore=10489 insert=0 insert_failed=0 drop=0 early_drop=0 error=277 search_restart=0
cpu=23 found=0 invalid=238 ignore=11165 insert=0 insert_failed=0 drop=0 early_drop=0 error=238 search_restart=2
In my case, disabling the offloading for the VXLAN interface completely solves the issue. I've used calico cni with VXLAN mode.
Glad you found the issue @gjkim42
@prameshj wrt your comment on networkpolicies and impact on iptables - I am not clear on what I should be looking for there that would cause the effect I am seeing? We make extensive use of network policies.
I was wondering if there is some iptables rule that drops requests from nodelocaldns to coreDNS pod if it is in the same node. Are requests dropped 100% if they are talking within the same node?
Also is this the same issue as what @gjkim42 mentioned?
@rtmie
@prameshj "Are requests dropped 100% if they are talking within the same node?"
No just a relatively high servfail rate by comparison with running dnsperf workload on a node where there is no coredns
say. approx 0.02% vs 0%
@rtmie could anything in the CNI be handling intra-node traffic differently than inter-node?
I would suggest trying out in a 1-node cluster with a single coreDNS pod as well, so that all requests from nodelocaldns are forced to go to the local coreDNS pod. If requests are dropped more frequently in this case, a packet capture can be the next step.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
/reopen
I am facing the same issue and don't see any solution posted here. Is there any way I can fix this?
@rahul-paigavan: You can't reopen an issue/PR unless you authored it or you are a collaborator.
@rahul-paigavan In most modern K8S systems the DNS and TLS layer is abused by not properly using Keep Alive (HTTP / TCP) of the connections, this causing DNS to be queiried as well TLS to be handshaked on each request - here is a great blog post about that https://www.lob.com/blog/use-http-keep-alive in the context of NodeJS, but others are similar.
Other than that you should check and optimize the DNS zones inside the Nodelocal, look for:
Overall the DNS and TLS layer is the most abused and most neglected layer by Engineers, but it is the most important thing in a distrubuted, clusterized system - so put decent enough of your time and efforts in to that as you will regret it!
Environment
coredns version 1.8.0 nodelocaldns version v1.17.0 kubernetes version : 1.20.8 kube-proxy in ipvs mode
Config
configmap:
Issue Description
We are seeing an issue with a high level of i/o timeout errors in the node-local-dns daemonset pods on those nodes where the coredns pod is also running .
Running
When sorting above output based on error count we see massive increase in error count for those nodes where coredns is running.
Reproduce
Running a k8s job with a dnsperf pod in it - with 3 fqdns to lookup - 1 on cluster service, one service in AWS and one external service
On a node where coredns is also running
On a node where coredns is not also running
cluster service resolution vs external resolution
when test set is all k8s service fqdns - SERVFAIL errors much lower than when test set is all external fqdns.
On node where no coredns is running: 0 servfail On node where coredns is running and all on cluster fqdns: 8 servfail from 1500000 queries On node where coredns is running and all on external fqdns: 164 servfail from 1500000 queries
When we scale coredns down to one replica and run dns perf on the same node we get no servfails int he test