Closed dudicoco closed 1 month ago
There are a number of reasons.
When running as cluster DNS, CoreDNS is configured with the Kubernetes plugin. This puts a watch on all EndpointSlices and Services (and other things, depending on your config). This means a persistent connection to the API server for each instance of CoreDNS, and the API server sending watch events down that channel for any changes to those resources. For clusters with thousands of nodes, that would put a substantial burden on the API server.
NodeLocalDNS, on the other hand, is only a cache and a stub resolver. It does not put a watch on the API server. This makes it much less of a burden on the API server, and also makes it a much smaller process since it does not need to use memory to hold those API resources.
NodeLocalDNS also solves a second problem. Early versions of Kubernetes would sometimes have failures due to the conntrack table filling up. This was found to be because UDP entries need to age out of the conntrack table, so a burst of DNS traffic could fill that table up (I seem to recall some kernel bugs may have also been involved, but this is several years ago). NodeLocalDNS turns off connection tracking for UDP traffic to the node local DNS IP address, and it upgrades requests made to cluster DNS from UDP to TCP. TCP is not subject to this issue since entries can be removed when the connection is closed.
Finally, even if we did use a DaemonSet, it wouldn't work the way you would hope. There is no guarantee that requests from a client would talk to the local CoreDNS instance. In fact, at the time NodeLocalDNS was created, it would be rare, because the local node would have no higher weight in the kube-proxy based load balancing. So if you had 1000 instances of CoreDNS, only 1/1000 would go to your local CoreDNS instance. I am not sure if that has changed, there has been some work on more topology-aware services. But I am not sure how far it has progressed - you would have to check with SIG Network.
Thanks for the info @johnbelamaric.
Regarding the API server connections, I have addressed that in https://github.com/coredns/helm/issues/86#issuecomment-1510393716 - other daemonsets also perform API calls - kube-proxy, CNI plugins, log collectors etc.
Regarding the conntrack issue - can't we turn off connection traffic for a coredns daemonset?
Regarding directing requests from the client to the local coredns instance - this is now possible with internal traffic policy, but in any case this problem would be present with nodelocaldns as well, which could negate its benefits. One possible issue that could occur when using coredns as a daemonset with internal traffic policy is that until the coredns pod is ready no DNS requests could be made by other pods on that node.
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.
- I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.
- It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
- name: HOST_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.hostIP
As @johnbelamaric mentioned, it would be the linear dependency. In the CoreDNS daemonset each CoreDNS instance would initialize watchers for EndpointSlices, Services and ConfigMaps. Overall effect would be defined: how frequently these objects change in your cluster, multiplied by N (num of nodes).
Nodelocaldns uses TCP to talk to clusterDNS pods thus it's not affected by the conntrack issue that much vs UDP.
Also keep in mind that with CoreDNS daemonset there will be no guarantee that client pod would talk to local CoreDNS pod on the same node. /etc/resolv.conf points to the kube-dns Service so the traffic would go to the any pod in the cluster. Plus, as the default DNS protocol is UDP, as client will communicate with any CoreDNS pod in the cluster, the conntrack exhaustion issue will reappear in such setup.
Whereas with nodelocaldns (with the iptables rules) the client is guranteed to talk to the local NLD pod on the same node.
@dpasiukevich
- That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.
- It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.
Why would it require more DIY? Couldn't it be implemented into coredns directly?
Another idea: have a nodelocaldns container and a coredns sidecar container in the same pod and direct traffic from nodelocaldns to coredns via localhost, this would simplify the architecture while preserving the benefits of nodelocaldns without requiring new features or code changes. A possible issue would be if the nodelocaldns container starts before the coredns container, in that case the DNS resolution would fail. I assume this can be solved by having nodelocal wait for coredns to be available.
@johnbelamaric @dpasiukevich any updates?
It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?
Yes. It doesn't scale.
By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.
By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.
By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.
What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity. This is what zelando are doing but with dnsmasq instead of nodelocaldns, according to them it performs better.
By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.
Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory. We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.
What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.
That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.
Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.
The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.
We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.
That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.
What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.
That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.
I don't think it is more complex than nodelocaldns daemonset + coredns deployment + dns autoscaler, however using just nodelocaldns with the kubernetes plugin would be preferable, i'm not sure how it would deal with non-cached responses in that case though?
Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.
The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.
What is considered a large cluster? There is no info on number of services/endpoints within https://kubernetes.io/docs/setup/best-practices/cluster-large/.
We are running ~500 services and ~500 endpoints.
We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.
That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.
Thanks for the clarification.
What is considered a large cluster? ... We are running ~500 services and ~500 endpoints.
Per the link 150000 Pods per cluster. Each pod can have multiple services and endpoints.
I expect endpoint churn (per unit time) to be a more useful number than absolute number of endpoints.
There's nothing stoping one from using a DaemonSet with maxSurge=1 and maxUnavailable=0 together with internalTrafficPolicy: Local with the vanilla coredns image.
The suggested way to autoscale coredns is proportional to cluster size, exactly the same as scaling with a daemonset, except with a configurable coresPerReplica
rather than coresPerReplica
being equal to the number of cores per machine.
The suggested config in the doc is "coresPerReplica":256,"nodesPerReplica":16
which is also "linear" any which way you look at it, the advantage is that you can chose K
and run a fraction of coreDNS pods when you have small nodes. At worst the DaemonSet method results in 16x the load on the API.
As such I see no good argument to not run CoreDNS as a Daemonset.
On the contrary I can think of quite a few advantages to the Daemonset approach:
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
any updates?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Hello everyone , I just drafted a pull request to show how to use coredns and cilium to implement nodelocal dns, I've tested it and it worked, without duplicated __PILLAR__
variables, like @vaskozl mentioned. please see the pr linked above to get more info.
Also I noticed that @johnbelamaric said
node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.
In that case, if I'm using cilium and bpf to rewrite requests, I can use coredns instead, or there's any hidden pits that I'm not aware of?
If you are getting your requests directed locally by cilium/bpf instead of the iptables rules that NodeLocalDNS installs, then yeah, running coredns should be OK. The other thing it does is turn off connection tracking for those requests, so that you don't run into the conntrack overflow issues we have seen in the past. Does your solution handle that? There were some older kernel bugs that this also helped avoid, IIRC - not sure the status of those.
As discussed above, I still would not use the standard K8s DNS Corefile though - I would create a custom one that just enables cache and maybe stub domains for this. Definitely not the K8s plugin, especially if you have a large cluster. I don't recall if NodeLocalDNS has some special stub domain support or not, where it reads stub domain definitions from the api server. That tickles a memory but it's been a long time since I looked.
Also, the NodeLocalDNS build of CoreDNS is stripped down to have as small a memory footprint as possible. If you use the standard CoreDNS, it will take more memory than the node local one.
Of course, you could build your own, minimal CoreDNS instance for, too.
Thanks for your reply, here's the advantage of node-cache/NodeLocalDNS I summarized:
I think I can test or handle those issues by:
I'll do more work to see which solution to adopt.
turn off connection tracking to fix conntrack issues which affect only kernel older than https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-2065667530
No, I think conntrack can fill up with any kernel version. The issue is that since UDP is connectionless, conntrack entries are expunged by timeout rather than a connection closing. AIUI that issue is unrelated to the kernel bugs which caused problems a few years ago.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Hi! I'd like to relieve the issue. As I mentioned in https://github.com/coredns/helm/issues/86 - I have ONE proper use case for DaemonSet. When you have a bare metal (or VM) self managed cluster and you don't want to use HPA or anything similar and want to deploy CoreDNS on control plane nodes... so daemon set shines in such a case. Yes, I understand it is probably not more than 5% of all users of k8s... but why should we ignore them? Anyway it won't create a breaking changes and it would be relatively easy to maintain the change.
The current recommended DNS architecture solution within a cluster includes NodeLocal DNS + CoreDNS deployment + DNS autoscaler.
To me it would seem preferable to use a much simpler solution - run CoreDNS as a daemonset.
Is there a downside to such a solution? Why is the recommended solution include a more complex architecture?
See also https://github.com/coredns/helm/issues/86#issuecomment-1502024775