Improve DNS lookup performance by working around ndots

lfrancke commented 1 month ago

The default resolv.conf file generated by a kubelet contains an option called ndots: 5 which can result in 4 extra DNS requests for every lookup which we can prevent.

Example resolv.conf:

search default.svc.cluster.local svc.cluster.local cluster.local localdomain
nameserver 10.96.0.10
options ndots:5

Example DNS request for foo-service.foo.svc.cluster.local from within a Pod results in these DNS requests in Core DNS:

"A IN foo-service.foo.svc.cluster.local.default.svc.cluster.local. udp 100 false 4096" NXDOMAIN qr,aa,rd 170 0.000081963s
"A IN foo-service.foo.svc.cluster.local.svc.cluster.local. udp 92 false 4096" NXDOMAIN qr,aa,rd 162 0.000039183s
"A IN foo-service.foo.svc.cluster.local.cluster.local. udp 88 false 4096" NXDOMAIN qr,aa,rd 158 0.000030127s
"A IN foo-service.foo.svc.cluster.local.localdomain. udp 86 false 4096" NXDOMAIN qr,rd,ra 63 0.000683273s
"A IN foo-service.foo.svc.cluster.local. udp 74 false 4096" NOERROR qr,aa,rd 100 0.000030406s

The documentation for ndots says:

sets a threshold for the number of dots which must appear in a name before an initial absolute query will be made. The default for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it.

So, it tries each of the entries from the search list in order before it tries an absolute query. In our case we try to use FQDNs everywhere to avoid ambiguity which means these extra lookups are not necessary but you can see that even using this FQDN does not help.

I see two options to work around this:

Option 1: Add "."

If we appended a "." to all our lookups we would have 5 dots and an absolute lookup will be performed. A lookup for foo-service.foo.svc.cluster.local. results in:

"A IN foo-service.foo.svc.cluster.local. udp 74 false 4096" NOERROR qr,aa,rd 100 0.000051045s

Option 2: Change DNS options for our Pods

By creating all our Pods with this additional config we can work around the problem "globally":

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "1"

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config

Option 3: Do both

Option 1 might be complicated to catch all cases if anyone uses pod overrides or if we assemble a name somewhere in a non-standard location. Option 2 might - in theory - be overridden. We could just apply both. Everything still works without this as it's just an optimization and I don't see a downside in just applying both options.

nightkr commented 1 month ago

I definitely think we should do at least 1.

I see the argument for 2, but would break user configuration that relies on it (either accidentally, or as a workaround to keep their environments isolated). I'd rather start moving towards maybe eventually seeing the default changed in a hypothetical Pod/v2 API than be "the one outlier" in terms of what config we accept.

lfrancke commented 4 weeks ago

Can you elaborate? Can you give me an example in which user scenario this would break?

eminaktas commented 3 weeks ago

I would suggest ndots as 2 which could be the best option. Please see my suggestion for dns configuration here

If you go with 1, the DNS queries with 1 or more dots is treated as an FQDN. For example, dummy-service.another-namespace will fail because it won't complete the query with svc.cluster.local. You would simply block the app to reach apps in another namespace.

I built kubedns-shepherd to have more control over DNS settings and even customizing it at the host level to optimize more. For example, this configuration in Kubespray repository.

lfrancke commented 3 weeks ago

Hey @eminaktas, thank you for the hints. That's a good point! We haven't made any decisions here yet so this is coming in at a good time.

As an aside: I did look through the README for kubedns-shepherd and saw your snippet on auto-discovery of clusterDomain etc. I am not entirely sure how you stumbled over this issue but you might have seen it pop up on #sig-network in Slack? If not, then this blog post might be of interest to you: https://stackable.tech/en/kubernetes-clusterdomain-setting/ and potentially this follow-up issue: https://github.com/stackabletech/issues/issues/662

eminaktas commented 2 weeks ago

Thanks @lfrancke.

I stumbled on this issue thanks to Antonio Ojea. He shared on X. I really enjoyed the article and I have also been looking for an API to get clusterDomain along with clusterDNS. I will definitely keep eye on this topic.

Can we also include clusterDNS? I am not sure if it is already exists via an API.

I am able to get the information like you did in the article.

kubectl get --raw /api/v1/nodes/minikube/proxy/configz | jq .kubeletconfig.clusterDNS

maltesander commented 1 week ago

Im with Nat, i think Option 1 will instantly reduce DNS queries in "default" setups significantly, while not having any other real drawbacks. Since we will always work with FQDNs, this is the only "constant" we can count on here.

Im also fine with simply documenting this from a Stackable point of view. E.g. "operators always write FQDNs ... in order to decrease DNS queries set ndots to 4" (assuming we do not add another (5th) "." in the end).

Concerning Option 2 or setting/overriding ndots, I also do not see the point to "doctor" around customer specific settings. I dont think that 1, 2 or 5 are "correct" or "best". This is simply too undeterministic, so im all for doing Option 1:

In code (adding another "." to FQDNs) -> needs some investigation, IMHO some operators never use or write FQDNs (Airflow?).
In docs (recommend ndots setting to reduce DNS queries for the overall SDP)

lfrancke commented 1 week ago

Thank you.

I think "just" adding a "." to the discovered or configured clusterDomain should be enough to catch most cases, no? It's okay to only have a 40-60% solution here. Anything is better than nothing.

maltesander commented 1 week ago

Yes, if we assume that 40-60% have "default" ndots 5. I think doing less is more here, since it just depends too heavily on customers, cluster setups etc., having this properly documented should be the first step, going for option 1 afterwards is not a huge change either.

stackabletech / issues