Using coredns daemonset instead of nodelocal dns

dudicoco commented 1 year ago

The current recommended DNS architecture solution within a cluster includes NodeLocal DNS + CoreDNS deployment + DNS autoscaler.

To me it would seem preferable to use a much simpler solution - run CoreDNS as a daemonset.

Is there a downside to such a solution? Why is the recommended solution include a more complex architecture?

johnbelamaric commented 1 year ago

There are a number of reasons.

When running as cluster DNS, CoreDNS is configured with the Kubernetes plugin. This puts a watch on all EndpointSlices and Services (and other things, depending on your config). This means a persistent connection to the API server for each instance of CoreDNS, and the API server sending watch events down that channel for any changes to those resources. For clusters with thousands of nodes, that would put a substantial burden on the API server.

NodeLocalDNS, on the other hand, is only a cache and a stub resolver. It does not put a watch on the API server. This makes it much less of a burden on the API server, and also makes it a much smaller process since it does not need to use memory to hold those API resources.

NodeLocalDNS also solves a second problem. Early versions of Kubernetes would sometimes have failures due to the conntrack table filling up. This was found to be because UDP entries need to age out of the conntrack table, so a burst of DNS traffic could fill that table up (I seem to recall some kernel bugs may have also been involved, but this is several years ago). NodeLocalDNS turns off connection tracking for UDP traffic to the node local DNS IP address, and it upgrades requests made to cluster DNS from UDP to TCP. TCP is not subject to this issue since entries can be removed when the connection is closed.

Finally, even if we did use a DaemonSet, it wouldn't work the way you would hope. There is no guarantee that requests from a client would talk to the local CoreDNS instance. In fact, at the time NodeLocalDNS was created, it would be rare, because the local node would have no higher weight in the kube-proxy based load balancing. So if you had 1000 instances of CoreDNS, only 1/1000 would go to your local CoreDNS instance. I am not sure if that has changed, there has been some work on more topology-aware services. But I am not sure how far it has progressed - you would have to check with SIG Network.

dudicoco commented 1 year ago

Thanks for the info @johnbelamaric.

Regarding the API server connections, I have addressed that in https://github.com/coredns/helm/issues/86#issuecomment-1510393716 - other daemonsets also perform API calls - kube-proxy, CNI plugins, log collectors etc.
Regarding the conntrack issue - can't we turn off connection traffic for a coredns daemonset?
Regarding directing requests from the client to the local coredns instance - this is now possible with internal traffic policy, but in any case this problem would be present with nodelocaldns as well, which could negate its benefits. One possible issue that could occur when using coredns as a daemonset with internal traffic policy is that until the coredns pod is ready no DNS requests could be made by other pods on that node.

johnbelamaric commented 1 year ago

Correct. But in general they limit the scope of what they are querying for to the things local to that node. For example, kubelet, kube-proxy, etc. do not monitor all pods and endpoints across the cluster, but instead only those assigned to their node.
Possibly. It's not controlled in that way, it's done through iptables rules IIRC. So you would need to do some magic but it's theoretically possible.
No, Node Local DNS changes the way pods running on the node do their DNS so that it goes to the local cache. It is not subject to this issue (it is not accessed via kube-proxy based load balancing rules).

dudicoco commented 1 year ago

What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.
I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.
It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
```
- name: HOST_IP
valueFrom:
fieldRef:
  apiVersion: v1
  fieldPath: status.hostIP
```

dpasiukevich commented 1 year ago

What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.

I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.

It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
- name: HOST_IP
  valueFrom:
    fieldRef:
      apiVersion: v1
      fieldPath: status.hostIP

As @johnbelamaric mentioned, it would be the linear dependency. In the CoreDNS daemonset each CoreDNS instance would initialize watchers for EndpointSlices, Services and ConfigMaps. Overall effect would be defined: how frequently these objects change in your cluster, multiplied by N (num of nodes).
Nodelocaldns uses TCP to talk to clusterDNS pods thus it's not affected by the conntrack issue that much vs UDP.

Also keep in mind that with CoreDNS daemonset there will be no guarantee that client pod would talk to local CoreDNS pod on the same node. /etc/resolv.conf points to the kube-dns Service so the traffic would go to the any pod in the cluster. Plus, as the default DNS protocol is UDP, as client will communicate with any CoreDNS pod in the cluster, the conntrack exhaustion issue will reappear in such setup.

Whereas with nodelocaldns (with the iptables rules) the client is guranteed to talk to the local NLD pod on the same node.

This should work, but I personally see this as a non-elegant solution as you'd have to define and keep this override for all pods in your cluster. Plus there may be unexplored consistency problems that HOST_IP points to the right IP all the time (e.g. some redeploys and status changes may cause brief unexpected outages).

dudicoco commented 1 year ago

@dpasiukevich

It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?
Why can't the same solution be applied to coredns? We could have an ip tables rule to direct DNS traffic to the local coredns pod (this also negates the need for a downward API based solution).

dpasiukevich commented 1 year ago

That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.
It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

dudicoco commented 1 year ago

That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.

It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

Why would it require more DIY? Couldn't it be implemented into coredns directly?

Another idea: have a nodelocaldns container and a coredns sidecar container in the same pod and direct traffic from nodelocaldns to coredns via localhost, this would simplify the architecture while preserving the benefits of nodelocaldns without requiring new features or code changes. A possible issue would be if the nodelocaldns container starts before the coredns container, in that case the DNS resolution would fail. I assume this can be solved by having nodelocal wait for coredns to be available.

dudicoco commented 1 year ago

@johnbelamaric @dpasiukevich any updates?

chrisohaver commented 1 year ago

It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?

Yes. It doesn't scale.

johnbelamaric commented 1 year ago

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

dudicoco commented 1 year ago

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity. This is what zelando are doing but with dnsmasq instead of nodelocaldns, according to them it performs better.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory. We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

chrisohaver commented 1 year ago

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

dudicoco commented 1 year ago

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

I don't think it is more complex than nodelocaldns daemonset + coredns deployment + dns autoscaler, however using just nodelocaldns with the kubernetes plugin would be preferable, i'm not sure how it would deal with non-cached responses in that case though?

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

What is considered a large cluster? There is no info on number of services/endpoints within https://kubernetes.io/docs/setup/best-practices/cluster-large/.

We are running ~500 services and ~500 endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

Thanks for the clarification.

chrisohaver commented 1 year ago

What is considered a large cluster? ... We are running ~500 services and ~500 endpoints.

Per the link 150000 Pods per cluster. Each pod can have multiple services and endpoints.

vaskozl commented 10 months ago

I expect endpoint churn (per unit time) to be a more useful number than absolute number of endpoints.

There's nothing stoping one from using a DaemonSet with maxSurge=1 and maxUnavailable=0 together with internalTrafficPolicy: Local with the vanilla coredns image.

The suggested way to autoscale coredns is proportional to cluster size, exactly the same as scaling with a daemonset, except with a configurable coresPerReplica rather than coresPerReplica being equal to the number of cores per machine.

The suggested config in the doc is "coresPerReplica":256,"nodesPerReplica":16 which is also "linear" any which way you look at it, the advantage is that you can chose K and run a fraction of coreDNS pods when you have small nodes. At worst the DaemonSet method results in 16x the load on the API.

As such I see no good argument to not run CoreDNS as a Daemonset.

On the contrary I can think of quite a few advantages to the Daemonset approach:

no iptables/ipvs kube-proxy or other CNI caveats due to the NOTRACK rules
simple scaling on vanilla clusters with no cluster-proportional-autoscaler deployed
quicker DNS resolution without needing second node-local DNS config with hardcoded service IPs
easier to debug and reason about when encountering issues

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

zengyuxing007 commented 7 months ago

/remove-lifecycle stale

zengyuxing007 commented 7 months ago

any updates?

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

LeoQuote commented 2 months ago

Hello everyone , I just drafted a pull request to show how to use coredns and cilium to implement nodelocal dns, I've tested it and it worked, without duplicated __PILLAR__ variables, like @vaskozl mentioned. please see the pr linked above to get more info.

Also I noticed that @johnbelamaric said

node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

In that case, if I'm using cilium and bpf to rewrite requests, I can use coredns instead, or there's any hidden pits that I'm not aware of?

johnbelamaric commented 2 months ago

If you are getting your requests directed locally by cilium/bpf instead of the iptables rules that NodeLocalDNS installs, then yeah, running coredns should be OK. The other thing it does is turn off connection tracking for those requests, so that you don't run into the conntrack overflow issues we have seen in the past. Does your solution handle that? There were some older kernel bugs that this also helped avoid, IIRC - not sure the status of those.

As discussed above, I still would not use the standard K8s DNS Corefile though - I would create a custom one that just enables cache and maybe stub domains for this. Definitely not the K8s plugin, especially if you have a large cluster. I don't recall if NodeLocalDNS has some special stub domain support or not, where it reads stub domain definitions from the api server. That tickles a memory but it's been a long time since I looked.

Also, the NodeLocalDNS build of CoreDNS is stripped down to have as small a memory footprint as possible. If you use the standard CoreDNS, it will take more memory than the node local one.

Of course, you could build your own, minimal CoreDNS instance for, too.

LeoQuote commented 2 months ago

Thanks for your reply, here's the advantage of node-cache/NodeLocalDNS I summarized:

embedded iptables modification to make it work without any external modification.
turn off connection tracking to fix conntrack issues which affect only kernel older than 4.19
disable k8s plugin to minimize impact to API server
stripped down to have minimal memory footprint

I think I can test or handle those issues by:

cilium can handle network
I can upgrade kernel to avoid this issue
yes, definitely disabled
I'll test for the actual memory usage

I'll do more work to see which solution to adopt.

johnbelamaric commented 2 months ago

turn off connection tracking to fix conntrack issues which affect only kernel older than https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-2065667530

No, I think conntrack can fill up with any kernel version. The issue is that since UDP is connectionless, conntrack entries are expunged by timeout rather than a connection closing. AIUI that issue is unrelated to the kernel bugs which caused problems a few years ago.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/dns/issues/594#issuecomment-2439683013): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

gecube commented 1 day ago

Hi! I'd like to relieve the issue. As I mentioned in https://github.com/coredns/helm/issues/86 - I have ONE proper use case for DaemonSet. When you have a bare metal (or VM) self managed cluster and you don't want to use HPA or anything similar and want to deploy CoreDNS on control plane nodes... so daemon set shines in such a case. Yes, I understand it is probably not more than 5% of all users of k8s... but why should we ignore them? Anyway it won't create a breaking changes and it would be relatively easy to maintain the change.

kubernetes / dns

Using coredns daemonset instead of nodelocal dns #594