telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.57k stars 517 forks source link

Slow DNS queries for service name w/o namespace #3076

Closed amkartashov closed 1 year ago

amkartashov commented 1 year ago

Nslookup is slow, query takes more than 10 seconds, query shows question section mismatch warnings

$ nslookup kong-kong-admin
;; ;; Question section mismatch: got kong-kong-admin.myns.svc.cluster.local/A/IN
Server:         172.19.80.1
Address:        172.19.80.1#53

Name:   kong-kong-admin.myns.svc.cluster.local
Address: 10.97.115.111
;; ;; Question section mismatch: got kong-kong-admin.myns.svc.cluster.local/AAAA/IN

To Reproduce

  1. Connect to remote cluster
  2. Create interceptor in ns myns
  3. Run nslookup
  4. It will be slow for 10 seconds, showing two warnings with Question section mismatch for A and AAAA queries.

Note that consequent request will work bc of cache. To reproduce the problem again wait for 1 minute so telepresence will do DNS cache cleanup.

Expected behavior

Fast queries, no warnings

Versions (please complete the following information):

Additional context

Python services started with telepresence intercept fails when using async http requests, because of 5 seconds timeout: https://github.com/encode/httpx/discussions/2167 https://github.com/encode/httpx/discussions/2321

Workaround: set options timeout:1 attempts:2 in /etc/resolv.conf to force retry in 1 second - it will hit DNS cache in telepresence

REASON

Here https://github.com/telepresenceio/telepresence/blob/c495f52c451ea8efe69d7a371e6d6c7bf13d20b4/pkg/client/rootd/dns/server_linux.go#L95 you modify question section and it is returned back to the DNS client (dig/nslookup/application/etc). Since this is UDP (no sessions), response should have exact the same question section as query does to be matched, and response with other name in question section is rejected.

So what happens is:

PATCH

The idea is simple: preserve original query and restore it back after DNS lookup in the cluster is finished.

--- a/pkg/client/rootd/dns/server_linux.go
+++ b/pkg/client/rootd/dns/server_linux.go
@@ -89,11 +89,14 @@ func (s *Server) resolveInSearch(c context.Context, q *dns.Question) (dnsproxy.R
        }

        if s.shouldApplySearch(query) {
+               origQuery := q.Name
                for _, sp := range s.search {
                        q.Name = query + sp
                        if rrs, rCode, err := s.resolveInCluster(c, q); err != nil || len(rrs) > 0 {
+                               q.Name = origQuery
                                return rrs, rCode, err
                        }
+                       q.Name = origQuery
                }
        }
        return s.resolveInCluster(c, q)
amkartashov commented 1 year ago

btw, I tested the patch and it works.

amkartashov commented 1 year ago

similar issue in coredns https://github.com/coredns/coredns/issues/1031

thallgren commented 1 year ago

@amkartashov great work, finding the problem and creating a fix for it. Thanks for doing that. I would recommend that you create a pull request with your fix. That way, you'll be listed as one of the contributors.

amkartashov commented 1 year ago

@amkartashov great work, finding the problem and creating a fix for it. Thanks for doing that. I would recommend that you create a pull request with your fix. That way, you'll be listed as one of the contributors.

done. Pls let me know if anything else is needed in PR. Not sure if this should be mentioned in changelog or testing fix on localhost can be considered as "adequately tested" in this case :)