siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.84k stars 547 forks source link

1.7.1 with hostdns and forwardKubeDNSToHost doesn't resolve anything #8698

Closed evanrich closed 5 months ago

evanrich commented 6 months ago

Bug Report

Description

This is on a cluster that has been upgraded (1.6.x->1.7.x), not fresh

after applying the following patch:

machine:
  features:
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: true

nothing seems to resolve dns, either in the cluster or externally

getupstream:

talosctl -n 192.168.5.10,192.168.5.11,192.168.5.12,192.168.5.15 get dnsupstream
NODE           NAMESPACE   TYPE          ID            VERSION   HEALTHY   ADDRESS
192.168.5.10   network     DNSUpstream   192.168.5.1   1         true      192.168.5.1:53
192.168.5.11   network     DNSUpstream   192.168.5.1   1         true      192.168.5.1:53
192.168.5.12   network     DNSUpstream   192.168.5.1   1         true      192.168.5.1:53
192.168.5.15   network     DNSUpstream   192.168.5.1   1         true      192.168.5.1:53

resolv.conf

 talosctl -n 192.168.5.10 read /system/resolved/resolv.conf
nameserver 10.96.0.9
talosctl -n 192.168.5.10 read /etc/resolv.conf
nameserver 127.0.0.53

resolvers

 talosctl -n 192.168.5.10 get resolvers
NODE           NAMESPACE   TYPE             ID          VERSION   RESOLVERS
192.168.5.10   network     ResolverStatus   resolvers   2         ["192.168.5.1"]

CoreDNS was restarted twice after applying the patch.

Logs

[ERROR] plugin/errors: 2 radarr.media.svc. AAAA: read udp 10.244.2.33:48571->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.142:37010 - 44799 "AAAA IN radarr.media.svc. udp 34 false 512" - - 0 2.001171487s
[ERROR] plugin/errors: 2 radarr.media.svc. AAAA: read udp 10.244.0.228:41133->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.142:37010 - 44353 "A IN radarr.media.svc. udp 34 false 512" - - 0 2.001187098s
[ERROR] plugin/errors: 2 radarr.media.svc. A: read udp 10.244.0.228:49164->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.30:55153 - 65462 "AAAA IN sonarr.media.svc. udp 34 false 512" - - 0 2.001133409s
[INFO] 10.244.0.30:55153 - 65275 "A IN sonarr.media.svc. udp 34 false 512" - - 0 2.001014136s
[ERROR] plugin/errors: 2 sonarr.media.svc. A: read udp 10.244.2.33:38186->10.96.0.9:53: i/o timeout
[ERROR] plugin/errors: 2 sonarr.media.svc. AAAA: read udp 10.244.2.33:57661->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.16:57244 - 50161 "AAAA IN api.allegion.yonomi.cloud. udp 43 false 512" - - 0 2.001230715s
[ERROR] plugin/errors: 2 api.allegion.yonomi.cloud. AAAA: read udp 10.244.0.228:33430->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.16:57244 - 49553 "A IN api.allegion.yonomi.cloud. udp 43 false 512" - - 0 2.001237302s
[ERROR] plugin/errors: 2 api.allegion.yonomi.cloud. A: read udp 10.244.0.228:47070->10.96.0.9:53: i/o timeout
[INFO] 10.244.1.242:47829 - 1031 "AAAA IN api.doppler.com. udp 44 false 1232" - - 0 2.001031405s
[INFO] 10.244.1.242:50138 - 44842 "A IN api.doppler.com. udp 44 false 1232" - - 0 2.001066446s
[ERROR] plugin/errors: 2 api.doppler.com. AAAA: read udp 10.244.0.228:52401->10.96.0.9:53: i/o timeout
[ERROR] plugin/errors: 2 api.doppler.com. A: read udp 10.244.0.228:48637->10.96.0.9:53: i/o timeout

Environment

Reverting the patch (false/false) fixes dns again.

FWIW, here's my coredns configmap:

.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    log . {
        class error
    }
    prometheus :9153

    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}
evanrich commented 6 months ago

coredns graphs go through the roof as well image

DmitriyMV commented 6 months ago

Greetings! Can you provide talosctl -n 192.168.5.10,192.168.5.11,192.168.5.12,192.168.5.15 logs dns-resolve-cache output?

evanrich commented 6 months ago

Greetings! Can you provide talosctl -n 192.168.5.10,192.168.5.11,192.168.5.12,192.168.5.15 logs dns-resolve-cache output?

sure! with

machine:
  features:
    hostDNS:
      enabled: true
      resolveMemberNames: true
      forwardKubeDNSToHost: false

i get ~13k lines, here's the last few:

192.168.5.12: 2024-05-05T19:15:00.325Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 27405\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:15:00.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 27405\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:15:20.325Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 30173\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:15:20.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 30173\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:15:40.325Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26124\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:15:40.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26124\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:00.324Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 44019\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:00.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 44019\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:20.325Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26814\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:20.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26814\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:40.324Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 44389\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:16:40.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 44389\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:00.324Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 59770\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:00.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 59770\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:20.324Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 20152\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:20.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 20152\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:40.325Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 43480\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.12: 2024-05-05T19:17:40.326Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 43480\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}

with

machine:
  features:
    hostDNS:
      enabled: true
      resolveMemberNames: true
      forwardKubeDNSToHost: true

I get

192.168.5.10: 2024-05-05T19:20:52.012Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 13650\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.media.svc.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:20:52.012Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 45522\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.media.svc.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:20:52.013Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NXDOMAIN, id: 13650\n;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.media.svc.\tIN\t A\n\n;; AUTHORITY SECTION:\n.\t1800\tIN\tSOA\ta.root-servers.net. nstld.verisign-grs.com. 2024050501 1800 900 604800 86400\n"}
192.168.5.10: 2024-05-05T19:20:52.013Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NXDOMAIN, id: 45522\n;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.media.svc.\tIN\t AAAA\n\n;; AUTHORITY SECTION:\n.\t1800\tIN\tSOA\ta.root-servers.net. nstld.verisign-grs.com. 2024050501 1800 900 604800 86400\n"}
192.168.5.10: 2024-05-05T19:21:01.928Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 61250\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:01.928Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 61250\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:02.682Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 61066\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;radarr.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:02.682Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37810\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;radarr.domain.io.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:21:02.683Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 59884\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:02.683Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 35319\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:21:02.683Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 61066\n;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;radarr.domain.io.\tIN\t AAAA\n\n;; AUTHORITY SECTION:\ndomain.io.\t1710\tIN\tSOA\trose.ns.cloudflare.com. dns.cloudflare.com. 2340201800 10000 2400 604800 1800\n"}
192.168.5.10: 2024-05-05T19:21:02.683Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37810\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;radarr.domain.io.\tIN\t A\n\n;; ANSWER SECTION:\nradarr.domain.io.\t270\tIN\tA\t10.10.5.30\n"}
192.168.5.10: 2024-05-05T19:21:02.686Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 35319\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t A\n\n;; ANSWER SECTION:\nsonarr.domain.io.\t5\tIN\tA\t10.10.5.30\n"}
192.168.5.10: 2024-05-05T19:21:02.686Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 59884\n;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t AAAA\n\n;; AUTHORITY SECTION:\ndomain.io.\t1710\tIN\tSOA\trose.ns.cloudflare.com. dns.cloudflare.com. 2340201800 10000 2400 604800 1800\n"}
192.168.5.10: 2024-05-05T19:21:12.216Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37590\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;.\tIN\t NS\n"}
192.168.5.10: 2024-05-05T19:21:12.217Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37590\n;; flags: qr rd ra; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;.\tIN\t NS\n\n;; ANSWER SECTION:\n.\t3600\tIN\tNS\ta.root-servers.net.\n.\t3600\tIN\tNS\tb.root-servers.net.\n.\t3600\tIN\tNS\tc.root-servers.net.\n.\t3600\tIN\tNS\td.root-servers.net.\n.\t3600\tIN\tNS\te.root-servers.net.\n.\t3600\tIN\tNS\tf.root-servers.net.\n.\t3600\tIN\tNS\tg.root-servers.net.\n.\t3600\tIN\tNS\th.root-servers.net.\n.\t3600\tIN\tNS\ti.root-servers.net.\n.\t3600\tIN\tNS\tj.root-servers.net.\n.\t3600\tIN\tNS\tk.root-servers.net.\n.\t3600\tIN\tNS\tl.root-servers.net.\n.\t3600\tIN\tNS\tm.root-servers.net.\n"}
192.168.5.10: 2024-05-05T19:21:19.651Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 20589\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;s3.domain.io.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:21:19.651Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26931\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;s3.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:19.652Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 20589\n;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;s3.domain.io.\tIN\t A\n\n;; ANSWER SECTION:\ns3.domain.io.\t296\tIN\tA\t104.21.30.117\ns3.domain.io.\t296\tIN\tA\t172.67.172.226\n"}
192.168.5.10: 2024-05-05T19:21:19.652Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 26931\n;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;s3.domain.io.\tIN\t AAAA\n\n;; ANSWER SECTION:\ns3.domain.io.\t296\tIN\tAAAA\t2606:4700:3035::6815:1e75\ns3.domain.io.\t296\tIN\tAAAA\t2606:4700:3037::ac43:ace2\n"}
192.168.5.10: 2024-05-05T19:21:21.928Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37418\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:21.928Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 37418\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:30.483Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 41265\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;plex.tv.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:21:30.483Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 3690\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;plex.tv.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:30.484Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 41265\n;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;plex.tv.\tIN\t A\n\n;; ANSWER SECTION:\nplex.tv.\t30\tIN\tA\t34.243.94.189\nplex.tv.\t30\tIN\tA\t34.241.88.179\n"}
192.168.5.10: 2024-05-05T19:21:30.484Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 3690\n;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;plex.tv.\tIN\t AAAA\n\n;; AUTHORITY SECTION:\nplex.tv.\t207\tIN\tSOA\tjeremy.ns.cloudflare.com. dns.cloudflare.com. 2340420772 10000 2400 604800 1800\n"}
192.168.5.10: 2024-05-05T19:21:41.927Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 38917\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:41.928Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 38917\n;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;k8s.lab.domain.io.\tIN\t AAAA\n"}
192.168.5.10: 2024-05-05T19:21:45.777Z DEBUG dns request {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 29216\n;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t A\n"}
192.168.5.10: 2024-05-05T19:21:45.778Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 29216\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0\n\n;; QUESTION SECTION:\n;sonarr.domain.io.\tIN\t A\n\n;; ANSWER SECTION:\nsonarr.domain.io.\t257\tIN\tA\t10.10.5.30\n"}

As soon as the patch is applied and coredns restarted, I start immediately seeing issues, for example in my homeassistant logs:

 (SyncWorker_14) [custom_components.radarr_upcoming_media.sensor] Host radarr.domain.io is not available
2024-05-05 12:21:37.684 WARNING (SyncWorker_3) [custom_components.sonarr_upcoming_media.sensor] Host sonarr.domain.io is not available
2024-05-05 12:22:07.685 WARNING (SyncWorker_4) [custom_components.radarr_upcoming_media.sensor] Host radarr.domain.io is not available
2024-05-05 12:22:07.687 WARNING (SyncWorker_50) [custom_components.sonarr_upcoming_media.sensor] Host sonarr.domain.io is not available
2024-05-05 12:22:37.690 WARNING (SyncWorker_46) [custom_components.radarr_upcoming_media.sensor] Host radarr.domain.io is not available
2024-05-05 12:22:37.693 WARNING (SyncWorker_15) [custom_components.sonarr_upcoming_media.sensor] Host sonarr.domain.io is not available

and from the coredns deployment logs itself:

ERROR] plugin/errors: 2 ps.pndsn.com. AAAA: read udp 10.244.3.183:42921->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.142:37176 - 46031 "A IN radarr.media.svc. udp 34 false 512" - - 0 2.000961528s
[INFO] 10.244.0.142:37176 - 46771 "AAAA IN radarr.media.svc. udp 34 false 512" - - 0 2.000981288s
[ERROR] plugin/errors: 2 radarr.media.svc. AAAA: read udp 10.244.0.58:39689->10.96.0.9:53: i/o timeout
[ERROR] plugin/errors: 2 radarr.media.svc. A: read udp 10.244.0.58:56654->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.142:34045 - 18857 "AAAA IN sonarr.media.svc. udp 34 false 512" - - 0 2.0010946020000002s
[ERROR] plugin/errors: 2 sonarr.media.svc. AAAA: read udp 10.244.3.183:57222->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.142:34045 - 18443 "A IN sonarr.media.svc. udp 34 false 512" - - 0 2.001069037s
[ERROR] plugin/errors: 2 sonarr.media.svc. A: read udp 10.244.3.183:60187->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.171:33221 - 59865 "A IN s3.domain.io. udp 33 false 512" - - 0 2.001200777s
[INFO] 10.244.0.171:33221 - 17887 "AAAA IN s3.domain.io. udp 33 false 512" - - 0 2.001220341s
[ERROR] plugin/errors: 2 s3.domain.io. A: read udp 10.244.3.183:42636->10.96.0.9:53: i/o timeout
[ERROR] plugin/errors: 2 s3.domain.io. AAAA: read udp 10.244.3.183:51402->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.9:44995 - 38535 "AAAA IN ps.pndsn.com. udp 30 false 512" - - 0 2.00101046s
[ERROR] plugin/errors: 2 ps.pndsn.com. AAAA: read udp 10.244.3.183:46826->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.9:44995 - 38373 "A IN ps.pndsn.com. udp 30 false 512" - - 0 2.001172459s
[ERROR] plugin/errors: 2 ps.pndsn.com. A: read udp 10.244.3.183:49263->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.213:33246 - 976 "A IN github.com. udp 39 false 1232" - - 0 2.00063978s
[ERROR] plugin/errors: 2 github.com. A: read udp 10.244.3.183:57869->10.96.0.9:53: i/o timeout
[INFO] 10.244.0.213:41944 - 26300 "AAAA IN github.com. udp 39 false 1232" - - 0 2.001602828s
[ERROR] plugin/errors: 2 github.com. AAAA: read udp 10.244.0.58:52988->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.9:44995 - 38535 "AAAA IN ps.pndsn.com. udp 30 false 512" - - 0 2.000211697s
[ERROR] plugin/errors: 2 ps.pndsn.com. AAAA: read udp 10.244.3.183:48895->10.96.0.9:53: i/o timeout
[INFO] 10.244.3.9:44995 - 38373 "A IN ps.pndsn.com. udp 30 false 512" - - 0 2.000241596s
[ERROR] plugin/errors: 2 ps.pndsn.com. A: read udp 10.244.3.183:41034->10.96.0.9:53: i/o timeout

changing forwardKubeDNSToHost: true back to false brings things back to normal. I can post my machine config if that helps but don't have anything too crazy there. upon restarting the coredns deployment, the logs are clean again:

.:53
[INFO] plugin/reload: Running configuration SHA512 = f43368fe881b6cd37b121f37ba0b71c065df5bfc99b5c5c05d7f95bf82289d7ab7e78d5b98c1f02172d8004a8a8f34027cef04e86c780d40a7c5d1301559f5b3
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
.:53
[INFO] plugin/reload: Running configuration SHA512 = f43368fe881b6cd37b121f37ba0b71c065df5bfc99b5c5c05d7f95bf82289d7ab7e78d5b98c1f02172d8004a8a8f34027cef04e86c780d40a7c5d1301559f5b3
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
chrxmvtik commented 6 months ago

Same issue for me, but unfortunately disabling hostDNS features doesn't resolve the issue.

I am using my own DNS servers, however using public DNS servers didn't help.

It worked fine using version 1.6.7, failed to work from 1.7.0, keeps failing in 1.7.1.

smira commented 6 months ago

It worked fine using version 1.6.7, failed to work from 1.7.0, keeps failing in 1.7.1.

Let's not mix different issues in one ticket please.

smira commented 6 months ago

@evanrich what is the CNI you're using?

evanrich commented 6 months ago

@evanrich what is the CNI you're using?

Cilium v1.15.4

MathiasPius commented 6 months ago

I'm seeing the same problem on Talos 1.7.1 (also upgraded from earlier versions), Kubernetes 1.29.1, Cilium 1.15.4.

I am using DHCP-discovered public DNS servers run by Hetzner.

Hubble (Cilium packet inspection) reports that the UDP requests from CoreDNS to the Talos DNS service IP (10.96.0.9 in my case) are delivered, but the response packets from 10.96.0.9 to CoreDNS pod are dropped with the reason TTL Exceeded.

pau-campana commented 6 months ago

I have the same error. I'm using talos v1.7.1 and cilium v1.14.7

chrxmvtik commented 6 months ago

I'm seeing the same problem on Talos 1.7.1 (also upgraded from earlier versions), Kubernetes 1.29.1, Cilium 1.15.4.

I am using DHCP-discovered public DNS servers run by Hetzner.

Hubble (Cilium packet inspection) reports that the UDP requests from CoreDNS to the Talos DNS service IP (10.96.0.9 in my case) are delivered, but the response packets from 10.96.0.9 to CoreDNS pod are dropped with the reason TTL Exceeded.

Check if you are using bpf.masquerade if yes and you did not specify CIDRs manually, then with common private CIDRs you will get above error.

Try to set bpf.masquerade option to false and check if that works.

MathiasPius commented 6 months ago

I'm seeing the same problem on Talos 1.7.1 (also upgraded from earlier versions), Kubernetes 1.29.1, Cilium 1.15.4. I am using DHCP-discovered public DNS servers run by Hetzner. Hubble (Cilium packet inspection) reports that the UDP requests from CoreDNS to the Talos DNS service IP (10.96.0.9 in my case) are delivered, but the response packets from 10.96.0.9 to CoreDNS pod are dropped with the reason TTL Exceeded.

Check if you are using bpf.masquerade if yes and you did not specify CIDRs manually, then with common private CIDRs you will get above error.

Try to set bpf.masquerade option to false and check if that works.

Sounds very plausible. However, bpf masquerade is disabled for my use case, but I can see that iptables masquerade for ipv4 is enabled. I would assume disabling this would have the same effect?

Edit: I disabled all masquerading:

$ kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep Masquerading
Masquerading:            Disabled

But I'm still seeing the exact same issue. I am now seeing the issue with the public IP address of the DNS Server instead.

It seems to me that masquerading is a very likely culprit, but I'm not sure how exactly yet. Will keep digging.

smira commented 5 months ago

The fix is coming, thanks for reporting it, it's indeed the TTL. It's only related to fowardKubeDNSToHost option which is not enabled by default in Talos 1.7 (only enabled for Docker-based clusters).

DmitriyMV commented 5 months ago

Reopened until there is 1.7 backport.

DmitriyMV commented 5 months ago

Closed per #8758

evanrich commented 5 months ago

@DmitriyMV not sure if this is related, but after upgrading 1.7.1->1.7.2, while better, I now see other errors:

.:53
[INFO] plugin/reload: Running configuration SHA512 = f43368fe881b6cd37b121f37ba0b71c065df5bfc99b5c5c05d7f95bf82289d7ab7e78d5b98c1f02172d8004a8a8f34027cef04e86c780d40a7c5d1301559f5b3
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
.:53
[INFO] plugin/reload: Running configuration SHA512 = f43368fe881b6cd37b121f37ba0b71c065df5bfc99b5c5c05d7f95bf82289d7ab7e78d5b98c1f02172d8004a8a8f34027cef04e86c780d40a7c5d1301559f5b3
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
[INFO] 10.244.0.100:34471 - 62223 "AAAA IN registry.npmjs.org. udp 36 false 512" - - 0 5.00010545s
[ERROR] plugin/errors: 2 registry.npmjs.org. AAAA: dns: buffer size too small
[INFO] 10.244.0.23:42181 - 51958 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000088228s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:45773 - 55470 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000083312s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:45773 - 55229 "A IN api.ring.com. udp 30 false 512" - - 0 5.000134008s
[ERROR] plugin/errors: 2 api.ring.com. A: dns: overflowing header size
[INFO] 10.244.0.23:45773 - 55229 "A IN api.ring.com. udp 30 false 512" - - 0 5.000123943s
[ERROR] plugin/errors: 2 api.ring.com. A: dns: overflowing header size
[INFO] 10.244.0.23:45773 - 55470 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000255962s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:45773 - 55470 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000144994s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:40964 - 39446 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000402892s
[INFO] 10.244.0.23:40964 - 39215 "A IN api.ring.com. udp 30 false 512" - - 0 5.000479894s
[ERROR] plugin/errors: 2 api.ring.com. A: dns: overflowing header size
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:40964 - 39446 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000031736s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:40964 - 39215 "A IN api.ring.com. udp 30 false 512" - - 0 5.000118852s
[ERROR] plugin/errors: 2 api.ring.com. A: dns: overflowing header size
[INFO] 10.244.0.23:40964 - 39446 "AAAA IN api.ring.com. udp 30 false 512" - - 0 5.000074868s
[ERROR] plugin/errors: 2 api.ring.com. AAAA: dns: overflowing header size
[INFO] 10.244.0.23:40964 - 39215 "A IN api.ring.com. udp 30 false 512" - - 0 5.000111692s
[ERROR] plugin/errors: 2 api.ring.com. A: dns: overflowing header size

this is based off the following config:

machine:
  features:
    hostDNS:
      enabled: true
      resolveMemberNames: true
      forwardKubeDNSToHost: true

The only thing i flipped from 1.7.1. to 1.7.2 was the forward to host.

evanrich commented 5 months ago

1.7.3 fixes the errors above