kubernetes / dns

Kubernetes DNS service
Apache License 2.0
929 stars 468 forks source link

High level of i/o timeout erros in nodelocaldns pod when coredns pod on same node #480

Closed rtmie closed 2 years ago

rtmie commented 3 years ago

Environment

coredns version 1.8.0 nodelocaldns version v1.17.0 kubernetes version : 1.20.8 kube-proxy in ipvs mode

Config

configmap:

data:
  Corefile: |
    .:53 {
        errors
        cache {
                success 9984 30
                denial 9984 5
        }
        reload
        loop
        bind 169.254.20.10
        forward . 192.168.128.10 {
                force_tcp
        }
        prometheus :9253
        health 169.254.20.10:8080
        }

Issue Description

We are seeing an issue with a high level of i/o timeout errors in the node-local-dns daemonset pods on those nodes where the coredns pod is also running .

Running

for pod in `kubectl -n kube-system get po -l k8s-app=node-local-dns -o name|cut -f2 -d '/'`;do echo "$(kubectl -n kube-system get po $pod -o wide --no-headers| awk '{ print $1,$7,$5 }') $(kubectl -n kube-system logs $pod|grep ERRO|wc -l)";done

When sorting above output based on error count we see massive increase in error count for those nodes where coredns is running.

Reproduce

Running a k8s job with a dnsperf pod in it - with 3 fqdns to lookup - 1 on cluster service, one service in AWS and one external service

On a node where coredns is also running

Statistics:

  Queries sent:         1500000
  Queries completed:    1500000 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1499884 (99.99%), SERVFAIL 116 (0.01%)
  Average packet size:  request 52, response 224
  Run time (s):         300.000132
  Queries per second:   4999.997800

  Average Latency (s):  0.000260 (min 0.000044, max 1.067598)
  Latency StdDev (s):   0.012459

On a node where coredns is not also running

Statistics:

  Queries sent:         1500000
  Queries completed:    1500000 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1500000 (100.00%)
  Average packet size:  request 52, response 224
  Run time (s):         300.000153
  Queries per second:   4999.997450

  Average Latency (s):  0.000208 (min 0.000045, max 1.082059)
  Latency StdDev (s):   0.010506

cluster service resolution vs external resolution

when test set is all k8s service fqdns - SERVFAIL errors much lower than when test set is all external fqdns.

On node where no coredns is running: 0 servfail On node where coredns is running and all on cluster fqdns: 8 servfail from 1500000 queries On node where coredns is running and all on external fqdns: 164 servfail from 1500000 queries

When we scale coredns down to one replica and run dns perf on the same node we get no servfails int he test

prameshj commented 3 years ago

Are you able to run a packet capture on one of the nodes that runs coreDNS and nodelocaldns and see the response? Any logs in the corresponding coreDNS pod?

prameshj commented 3 years ago

/assign @rtmie

rtmie commented 3 years ago

@prameshj I enabled logging via coredns core file.

No Fails seen in the coredns pod log (I do see some nxdomain denials but that is for a known missing service in my cluster, non related to the FQDNs in my dnsperf test).

However in the log of the node localdns pod I do see this error for each SERVFAIL in the dnsperf test:

"[ERROR] plugin/errors: 2 <external.fqdn.mydomain> A: dial tcp 192.168.128.10:53: i/o timeout\n"
--

(192.168.128.10 is the coredns service)

prameshj commented 3 years ago

The i/o timeout most likely means coreDNS pod did not respond in 2 seconds. NodeLocalDNS(which uses coreDNS) has a 2s read timeout for forwarded requests. It sends the request via TCP to coreDNS.

Could this particular coreDNS pod be getting overloaded? How many requests per second is that pod getting? Perhaps the test is hitting the limits of the qps that can be handled.

In case of external FQDNS, the i/o timeout log message will have the upstream server IP.

rtmie commented 3 years ago

The consistent behaviour that I see is that where I run the dnsperf load on a node that does not have coredns pod running I get zero i/o timeout errors in the node-local-dns. When I run it on a node where there is a coredns I see the tiomouts.

I wonder if conntrack causes the node-local-dns to hotspot the coredns that is local to it on the node causing it to overload? (we are using ipvs for kube-proxy).

Right now I have dnsperf running with qps of 5000 for 300 seconds.

I will test with a lower qps.

@prameshj is the 2 second timeout on the node-local-dns be configured?

rtmie commented 3 years ago

@prameshj when I run DNS perf job qith QPS of 5k on a node where there is no coredns pod I get no SERVFAIL.

When I run on a node where coredns is running I reduce qps to 500 and still SEVFAIL @0.01%

prameshj commented 3 years ago

I wonder if conntrack causes the node-local-dns to hotspot the coredns that is local to it on the node causing it to overload? (we are using ipvs for kube-proxy).

Can you check the dns_request_count metric on the various coreDNS pods and see if a specific one is getting more queries?

@prameshj is the 2 second timeout on the node-local-dns be configured?

I don't see it as configurable. The forward plugin sets an overall timeout of 5s to handle the query - https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/forward.go#L97

The connect call sets a write/read deadline of maxTimeout and readTImeout respectively. https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/connect.go#L99, https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/connect.go#L109

These are set to 2s right now. https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/persistent.go#L160, https://github.com/coredns/coredns/blob/5b9b079dabc7f71463cea3f0c6a92f338935039d/plugin/forward/proxy.go#L79

So, the behavior is - forward plugin will try to get a response from an upstream within 2s and contact a different one if there was no response from the previous. It will do this for 5s before failing the query. @chrisohaver is that accurate?

rtmie commented 3 years ago

@prameshj thanks for the info -I was away for a few days. Will go through your feedback today/tomorrow.

gjkim42 commented 3 years ago

I think I have the same issue. There are quite a lot of timeout requests from nodelocaldns to coredns. (about 70% of DNS requests are timeout in worst case node) And I just realized that only nodes having coredns pod have this issue as @rtmie said.

chrisohaver commented 3 years ago

@chrisohaver is that accurate?

Yes. A self adjusting dial timeout comes into play too, but that’s the general idea. None of it is configurable.

gjkim42 commented 3 years ago

Screen Shot 2021-09-29 at 12 02 25 AM

The i/o timeout most likely means coreDNS pod did not respond in 2 seconds. NodeLocalDNS(which uses coreDNS) has a 2s read timeout for forwarded requests. It sends the request via TCP to coreDNS.

In my case, I think coreDNS is working fine (it does not make any timeout). However nodelocaldns rarely makes any connection with coreDNS service(192.168.0.3).

# conntrack -L | grep 192.168.0.3
tcp      6 108 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51422 dport=53 [UNREPLIED] src=192.168.89.1 dst=192.168.126.0 sport=53 dport=19944 mark=0 use=1
tcp      6 107 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51414 dport=53 [UNREPLIED] src=192.168.76.1 dst=192.168.126.0 sport=53 dport=45629 mark=0 use=1
tcp      6 110 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51438 dport=53 [UNREPLIED] src=192.168.87.1 dst=192.168.126.0 sport=53 dport=22125 mark=0 use=1
tcp      6 105 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51394 dport=53 [UNREPLIED] src=192.168.66.2 dst=192.168.126.0 sport=53 dport=42584 mark=0 use=1
tcp      6 111 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51444 dport=53 [UNREPLIED] src=192.168.117.1 dst=192.168.126.0 sport=53 dport=43203 mark=0 use=1
tcp      6 109 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51430 dport=53 [UNREPLIED] src=192.168.72.1 dst=192.168.126.0 sport=53 dport=48859 mark=0 use=1
tcp      6 7 TIME_WAIT src=192.168.0.3 dst=192.168.0.3 sport=50610 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=11486 [ASSURED] mark=0 use=1
tcp      6 103 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51372 dport=53 [UNREPLIED] src=192.168.84.1 dst=192.168.126.0 sport=53 dport=35905 mark=0 use=1
tcp      6 86392 ESTABLISHED src=192.168.0.3 dst=192.168.0.3 sport=51454 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=42359 [ASSURED] mark=0 use=1
udp      17 20 src=192.168.0.3 dst=192.168.0.3 sport=51515 dport=53 [UNREPLIED] src=192.168.84.1 dst=192.168.126.0 sport=53 dport=53046 mark=0 use=1
tcp      6 102 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51366 dport=53 [UNREPLIED] src=192.168.111.4 dst=192.168.126.0 sport=53 dport=39537 mark=0 use=1
tcp      6 106 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51404 dport=53 [UNREPLIED] src=192.168.93.1 dst=192.168.126.0 sport=53 dport=33050 mark=0 use=1
udp      17 16 src=192.168.0.3 dst=192.168.0.3 sport=49564 dport=53 [UNREPLIED] src=192.168.111.4 dst=192.168.126.0 sport=53 dport=18936 mark=0 use=1
udp      17 23 src=192.168.0.3 dst=192.168.0.3 sport=37426 dport=53 [UNREPLIED] src=192.168.124.1 dst=192.168.126.0 sport=53 dport=65084 mark=0 use=1
tcp      6 67 TIME_WAIT src=192.168.0.3 dst=192.168.0.3 sport=51056 dport=53 src=192.168.126.1 dst=10.168.30.38 sport=53 dport=55891 [ASSURED] mark=0 use=1
tcp      6 104 SYN_SENT src=192.168.0.3 dst=192.168.0.3 sport=51384 dport=53 [UNREPLIED] src=192.168.124.1 dst=192.168.126.0 sport=53 dport=58595 mark=0 use=1
udp      17 27 src=192.168.0.3 dst=192.168.0.3 sport=59777 dport=53 [UNREPLIED] src=192.168.66.2 dst=192.168.126.0 sport=53 dport=55549 mark=0 use=1
udp      17 13 src=192.168.0.3 dst=192.168.0.3 sport=32790 dport=53 [UNREPLIED] src=192.168.78.1 dst=192.168.126.0 sport=53 dport=41575 mark=0 use=1
gjkim42 commented 3 years ago

Maybe I found a workaround for this - change ipvs scheduler to sh(source hash) from rr(round robin). However, I am not sure why it addresses this issue.

@rtmie @prameshj Any updates on this?

vivekpd15 commented 3 years ago

+1

prameshj commented 3 years ago

Maybe I found a workaround for this - change ipvs scheduler to sh(source hash) from rr(round robin).

I am not too familiar with the IPVS loadbalancing modes, but Source hash looks up the destination based on the source IP address of the packet. In case of NodeLocalDNS reaching out to upstream, the source IP will always be the same (the node IP address, since NodeLocalDNS is in hostNetwork mode). It is possible that it picks the same coreDNS pod every-time as a result. With round-robin, it is more spread out.

Are you able to lookup the dns request count(exported by coreDNS pods, NOT nodelocaldns) grouped by coreDNS pod name? Are some pods getting way more than others? Is it possible that those coreDNS pods were scheduled first?

gjkim42 commented 3 years ago

@prameshj

Are you able to lookup the dns request count(exported by coreDNS pods, NOT nodelocaldns) grouped by coreDNS pod name? Are some pods getting way more than others? Is it possible that those coreDNS pods were scheduled first?

Yes, one coredns pod is getting a lot of requests than others - which is not desirable for me.

It is possible that it picks the same coreDNS pod every-time as a result. With round-robin, it is more spread out.

Also, I am having another problem that DNS lookup from nodes to the coredns's service IP only fails.

https://calicousers.slack.com/archives/CPEPF833L/p1633007149311200 https://kubernetes.slack.com/archives/C09QYUH5W/p1632813013251600

DNS lookup

And also HTTP request from nodes to http://coredns's service IP/metrics succeeds. (the coredns service is reachable by HTTP protocol)

As I know, nodelocaldns is supposed to use the host node's network namespace. So I doubt that nodelocaldns has ever succeeded in DNS lookup to the coredns's service IP.

EDIT: The issue turned out to be a timeout issue rather than a connectivity issue.

prameshj commented 3 years ago

Do you also have NetworkPolicy rules configured? This could be an interaction between iptables rules setup for NodeLocalDNS and network policy maybe.

Is the issue reproducible in: 1) iptables mode? 2) With NodeLocalDNS disabled in IPVS mode? 3) WIth NetworkPolicy disabled(in case you have it enabled)?

gjkim42 commented 3 years ago

Do you also have NetworkPolicy rules configured? This could be an interaction between iptables rules setup for NodeLocalDNS and network policy maybe.

$ k get globalnetworkpolicies.crd.projectcalico.org -A
No resources found
$ k get networkpolicies.crd.projectcalico.org -A
No resources found
$ k get networkpolicies.networking.k8s.io -A
No resources found

I think I don't have any NetworkPolicies as I know.

rtmie commented 3 years ago

@gjkim42 - changing the ipvs scheduler in kube-proxy from rr to sh did not make a difference on my dnsperf test.

rtmie commented 3 years ago

@prameshj wrt your comment on networkpolicies and impact on iptables - I am not clear on what I should be looking for there that would cause the effect I am seeing? We make extensive use of network policies.

gjkim42 commented 3 years ago

@prameshj

Is the issue reproducible in:

  1. iptables mode?
  1. The issue is reproducible in the iptables mode

Will disable nodelocaldns tomorrow.


There are many invalid records in the failing nodes. (Non-failing nodes have few invalid records)

# conntrack -S
cpu=0           found=0 invalid=685 ignore=11379 insert=0 insert_failed=0 drop=0 early_drop=0 error=677 search_restart=3 
cpu=1           found=0 invalid=39 ignore=10805 insert=0 insert_failed=0 drop=0 early_drop=0 error=39 search_restart=5 
cpu=2           found=0 invalid=214 ignore=10635 insert=0 insert_failed=0 drop=0 early_drop=0 error=212 search_restart=1 
cpu=3           found=0 invalid=121 ignore=10403 insert=0 insert_failed=0 drop=0 early_drop=0 error=119 search_restart=6 
cpu=4           found=0 invalid=289 ignore=10193 insert=0 insert_failed=0 drop=0 early_drop=0 error=289 search_restart=5 
cpu=5           found=0 invalid=43 ignore=10656 insert=0 insert_failed=0 drop=0 early_drop=0 error=41 search_restart=0 
cpu=6           found=0 invalid=64 ignore=10285 insert=0 insert_failed=0 drop=0 early_drop=0 error=64 search_restart=5 
cpu=7           found=0 invalid=84 ignore=10538 insert=0 insert_failed=0 drop=0 early_drop=0 error=82 search_restart=8 
cpu=8           found=0 invalid=274 ignore=11307 insert=0 insert_failed=0 drop=0 early_drop=0 error=274 search_restart=4 
cpu=9           found=0 invalid=152 ignore=10371 insert=0 insert_failed=0 drop=0 early_drop=0 error=152 search_restart=4 
cpu=10          found=0 invalid=103 ignore=10941 insert=0 insert_failed=0 drop=0 early_drop=0 error=103 search_restart=6 
cpu=11          found=0 invalid=697 ignore=10919 insert=0 insert_failed=0 drop=0 early_drop=0 error=697 search_restart=3 
cpu=12          found=0 invalid=303 ignore=9353 insert=0 insert_failed=0 drop=0 early_drop=0 error=303 search_restart=0 
cpu=13          found=0 invalid=28 ignore=9726 insert=0 insert_failed=0 drop=0 early_drop=0 error=28 search_restart=2 
cpu=14          found=0 invalid=414 ignore=9260 insert=0 insert_failed=0 drop=0 early_drop=0 error=414 search_restart=4 
cpu=15          found=0 invalid=687 ignore=9864 insert=0 insert_failed=0 drop=0 early_drop=0 error=687 search_restart=2 
cpu=16          found=0 invalid=113 ignore=8785 insert=0 insert_failed=0 drop=0 early_drop=0 error=113 search_restart=3 
cpu=17          found=0 invalid=521 ignore=9293 insert=0 insert_failed=0 drop=0 early_drop=0 error=521 search_restart=0 
cpu=18          found=0 invalid=986 ignore=9984 insert=0 insert_failed=0 drop=0 early_drop=0 error=986 search_restart=3 
cpu=19          found=0 invalid=284 ignore=10705 insert=0 insert_failed=0 drop=0 early_drop=0 error=284 search_restart=1 
cpu=20          found=0 invalid=59 ignore=9476 insert=0 insert_failed=0 drop=0 early_drop=0 error=59 search_restart=2 
cpu=21          found=0 invalid=656 ignore=10483 insert=0 insert_failed=0 drop=0 early_drop=0 error=656 search_restart=4 
cpu=22          found=0 invalid=277 ignore=10489 insert=0 insert_failed=0 drop=0 early_drop=0 error=277 search_restart=0 
cpu=23          found=0 invalid=238 ignore=11165 insert=0 insert_failed=0 drop=0 early_drop=0 error=238 search_restart=2
gjkim42 commented 3 years ago

In my case, disabling the offloading for the VXLAN interface completely solves the issue. I've used calico cni with VXLAN mode.

https://kubernetes.slack.com/archives/C09QYUH5W/p1633448073291500?thread_ts=1632813013.251600&cid=C09QYUH5W

https://github.com/projectcalico/calico/issues/3145

prameshj commented 3 years ago

Glad you found the issue @gjkim42

@prameshj wrt your comment on networkpolicies and impact on iptables - I am not clear on what I should be looking for there that would cause the effect I am seeing? We make extensive use of network policies.

I was wondering if there is some iptables rule that drops requests from nodelocaldns to coreDNS pod if it is in the same node. Are requests dropped 100% if they are talking within the same node?

Also is this the same issue as what @gjkim42 mentioned?

@rtmie

rtmie commented 3 years ago

@prameshj "Are requests dropped 100% if they are talking within the same node?"

No just a relatively high servfail rate by comparison with running dnsperf workload on a node where there is no coredns

say. approx 0.02% vs 0%

prameshj commented 2 years ago

@rtmie could anything in the CNI be handling intra-node traffic differently than inter-node?

I would suggest trying out in a 1-node cluster with a single coreDNS pod as well, so that all requests from nodelocaldns are forced to go to the local coreDNS pod. If requests are dropped more frequently in this case, a packet capture can be the next step.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/dns/issues/480#issuecomment-1087985611): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rahul-paigavan commented 2 years ago

/reopen

I am facing the same issue and don't see any solution posted here. Is there any way I can fix this?

k8s-ci-robot commented 2 years ago

@rahul-paigavan: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/dns/issues/480#issuecomment-1234674383): >/reopen > >I am facing the same issue and don't see any solution posted here. >Is there any way I can fix this? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
smoke commented 2 years ago

@rahul-paigavan In most modern K8S systems the DNS and TLS layer is abused by not properly using Keep Alive (HTTP / TCP) of the connections, this causing DNS to be queiried as well TLS to be handshaked on each request - here is a great blog post about that https://www.lob.com/blog/use-http-keep-alive in the context of NodeJS, but others are similar.

Other than that you should check and optimize the DNS zones inside the Nodelocal, look for:

Overall the DNS and TLS layer is the most abused and most neglected layer by Engineers, but it is the most important thing in a distrubuted, clusterized system - so put decent enough of your time and efforts in to that as you will regret it!