issues resolving DNSSEC queries with cloudflared as upstream

pi-hole / FTL

The Pi-hole FTL engine

https://pi-hole.net

Other

1.38k stars 196 forks source link

issues resolving DNSSEC queries with cloudflared as upstream #1263

Closed BreiteSeite closed 2 years ago

BreiteSeite commented 2 years ago

Versions

Pi-hole: v5.7
AdminLTE: v5.9
FTL: v5.12.1

Platform

OS and version: Debian 11.2
Platform: Raspbery Pi 4, Docker Swarm + traefik

Expected behavior

Can resolve careers.intuitive.com.

Actual behavior / bug

Browser loads forever. dig reports error:

pi@rpi:~ $ dig @127.0.0.1 careers.intuitive.com
;; Truncated, retrying in TCP mode.
;; communications error to 127.0.0.1#53: end of file

;; communications error to 127.0.0.1#53: end of file

Steps to reproduce

Steps to reproduce the behavior:

run dig careers.intuitive.com against your pihole

Debug Token

URL: https://tricorder.pi-hole.net/rrUKKbsG/

Additional context

I run pihole on a single-node docker swarm cluster. Bug happens when scaling to 1 replica as well.

Upstream DNS is a cloudflared container in version 2021.12.3. DNSSEC on pi-hole is enabled.

Resolving the record directly via upstream (cloudflared) works fine

pi@rpi:~ $ dig @127.0.0.1 -p 5053 careers.intuitive.com

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 5053 careers.intuitive.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41953
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 5b87124b2c655d74fcaa48b961c9ebe59f15269145e14501 (good)
;; QUESTION SECTION:
;careers.intuitive.com.     IN  A

;; ANSWER SECTION:
careers.intuitive.com.  60  IN  CNAME   intuitive.phenompeople.com.
intuitive.phenompeople.com. 60  IN  CNAME   hubsite-prod13-62952224.us-east-1.elb.amazonaws.com.
hubsite-prod13-62952224.us-east-1.elb.amazonaws.com. 60 IN A 34.205.21.19
hubsite-prod13-62952224.us-east-1.elb.amazonaws.com. 60 IN A 35.173.207.80

;; Query time: 31 msec
;; SERVER: 127.0.0.1#5053(127.0.0.1)
;; WHEN: Mon Dec 27 17:37:57 CET 2021
;; MSG SIZE  rcvd: 364

Resolving other queries via pihole works fine

pi@rpi:~ $ dig @127.0.0.1 duck.com

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 duck.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5310
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: bb98f8eaabbd5e4eacc8c68c61c9ee1560abe8faae3dede3 (good)
;; QUESTION SECTION:
;duck.com.          IN  A

;; ANSWER SECTION:
duck.com.       10  IN  A   52.142.124.215

;; Query time: 623 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Dec 27 17:47:17 CET 2021
;; MSG SIZE  rcvd: 89

Webinterface shows N/A Reply

Long-Term query data for affected domain

Pi-Hole Remote output

The domain is not on any blacklist or blocklist

tcpdump

Attached a tcpdump generated via sudo tcpdump 'port 53 and (dst host localhost) or (src host 172.18.0.1)' -i docker_gwbridge -A -w bug.pcap ~bug.pcap.zip~ (removed)

DL6ER commented 2 years ago

My analysis powers are severely limited right now but I just checked via remote SSH that this domain does indeed work from my Pi-hole with a local unbound instance as upstream resolver. I don't have cloudflare available myself.

Can you quote the corresponding lines from your log file /var/log/pi-hole.log please?

Screenshots I just generated from mobile:

Screenshot_2021-12-27-17-57-20-98__01

Screenshot_2021-12-27-17-59-30-25__01

BreiteSeite commented 2 years ago

Thank you for that quick response.

I'm not sure if the upstream is responsible for this.

My /var/log/pihole.log:

Dec 27 18:23:40 dnsmasq[519]: query[A] careers.intuitive.com from 10.0.2.27
Dec 27 18:23:40 dnsmasq[519]: forwarded careers.intuitive.com to 172.17.0.1
Dec 27 18:23:40 dnsmasq[519]: dnssec-query[DNSKEY] phenompeople.com to 172.17.0.1
Dec 27 18:23:40 dnsmasq[519]: reply careers.intuitive.com is <CNAME>
Dec 27 18:23:40 dnsmasq[519]: reply intuitive.phenompeople.com is <CNAME>
Dec 27 18:23:40 dnsmasq[519]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 34.205.21.19
Dec 27 18:23:40 dnsmasq[519]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 35.173.207.80
Dec 27 18:23:40 dnsmasq[1888]: query[A] careers.intuitive.com from 10.0.2.27

BreiteSeite commented 2 years ago

If i change my upstream DNS to quad9, everything works but as soon as i change it back to my cloudflared resolver i trigger the bug. So it seems like a combination of dnsmasq/FTL and cloudflared.

Here is my swarm service config:

version: "3"

# More info at https://github.com/pi-hole/docker-pi-hole/ and https://docs.pi-hole.net/
services:
  pihole:
    image: pihole/pihole:latest
    networks:
      - "traefik"
    environment:
      TZ: 'Europe/Berlin'
    volumes:
      - 'etc-pihole:/etc/pihole'
      - 'etc-dnsmasq:/etc/dnsmasq.d'
    restart: unless-stopped
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.tcp.routers.dnstcp.entrypoints=dnstcp"
        - "traefik.tcp.routers.dnstcp.rule=HostSNI(`*`)"
        - "traefik.tcp.services.pihole.loadbalancer.server.port=53"

        - "traefik.udp.routers.dnsudp.entrypoints=dnsudp"
        - "traefik.udp.services.pihole.loadbalancer.server.port=53"

        - "traefik.http.services.pihole.loadbalancer.server.port=80"
        - "traefik.http.services.pihole.loadbalancer.sticky.cookie=true"
        - "traefik.http.routers.pihole.rule=Host(`my-pihole-hostnamei`)"
        - "traefik.http.routers.pihole.service=pihole"
        - "traefik.http.routers.pihole.entrypoints=https"
        - "traefik.http.routers.pihole.tls=true"

  cloudflared:
    image: raspbernetes/cloudflared:latest
    command: "proxy-dns --address 0.0.0.0 --port 5053 --upstream https://dns11.quad9.net/dns-query"
    networks:
      - "traefik"
    deploy:
      labels:
        - "traefik.enable=true"

        - "traefik.udp.routers.cloudflared.entrypoints=cloudflared"
        - "traefik.udp.services.cloudflared.loadbalancer.server.port=5053"
networks:
  traefik:
    external: true

volumes:
  etc-pihole:
  etc-dnsmasq:

DL6ER commented 2 years ago

Could you record a similar pcap for the case where it works fine? Should make it easier to precisely compare the answer returned by cloudflare and the other resolver and what the corresponding replies from FTL to dig are.

Also the pihole log snippets in both cases, please.

Was there anything more with dnsmasq[1888] in the log above? This looks like a TCP retry. Just for completeness sake.

I'll have a look at your files/logs as soon as possible.

BreiteSeite commented 2 years ago

I tested it dnsmasq directly (to see if this might be an dnsmasq bug) on my raspberrypi (without swarm etc.) and couldn't reproduce. But a lot of variables changed here as well (networking etc) so i'm not sure how useful this is..

version: '3.3'
services:
  dnsmasq:
    image: 4km3/dnsmasq:2.86-r0-alpine-edge
    privileged: true
    network_mode: host
    command: "-d --log-queries --no-resolv --no-hosts -S 127.0.0.1#5059 -a 127.0.0.1 -p 5058 -y"
  cloudflared:
    image: raspbernetes/cloudflared:latest
    network_mode: host
    command: "proxy-dns --address 127.0.0.1 --port 5059 --upstream https://dns11.quad9.net/dns-query"

networks:
  pihole-test:
    ipam:
      driver: default
      config:
        - subnet: 172.68.0.0/16

BreiteSeite commented 2 years ago

Was there anything more with dnsmasq[1888] in the log above? This looks like a TCP retry. Just for completeness sake.

No this appears to be all. It's a bit tricky obv. because of noise from other devices but this is appears to be everything for that query. I pressed return a couple of times in the tail to make sure i have a visual indicator of whats new and then just executed the dig command and copied the appearing block. I repeated this just to be sure and got the same block. Could it be that yours has more lines because it includes DNSSEC stuff that my instance might has cached? This is the result of my retry:

Dec 27 20:23:17 dnsmasq[418]: query[A] careers.intuitive.com from 10.0.2.26
Dec 27 20:23:17 dnsmasq[418]: forwarded careers.intuitive.com to 172.17.0.1
Dec 27 20:23:17 dnsmasq[418]: dnssec-query[DNSKEY] phenompeople.com to 172.17.0.1
Dec 27 20:23:17 dnsmasq[418]: reply careers.intuitive.com is <CNAME>
Dec 27 20:23:17 dnsmasq[418]: reply intuitive.phenompeople.com is <CNAME>
Dec 27 20:23:17 dnsmasq[418]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 34.205.21.19
Dec 27 20:23:17 dnsmasq[418]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 35.173.207.80
Dec 27 20:23:17 dnsmasq[2594]: query[A] careers.intuitive.com from 10.0.2.26
Dec 27 20:23:17 dnsmasq[2595]: query[A] careers.intuitive.com from 10.0.2.28

Could you record a similar pcap for the case where it works fine?

~bug-quad9-upstream.pcap.zip~ (removed)

Also the pihole log snippets in both cases, please.

Error-case see above, success case (direct resolving to quad9) below:

Dec 27 20:32:59 dnsmasq[2854]: query[A] careers.intuitive.com from 10.0.2.27
Dec 27 20:32:59 dnsmasq[2854]: forwarded careers.intuitive.com to 9.9.9.11
Dec 27 20:32:59 dnsmasq[2854]: validation result is INSECURE
Dec 27 20:32:59 dnsmasq[2854]: reply careers.intuitive.com is <CNAME>
Dec 27 20:32:59 dnsmasq[2854]: reply intuitive.phenompeople.com is <CNAME>
Dec 27 20:32:59 dnsmasq[2854]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 34.205.21.19
Dec 27 20:32:59 dnsmasq[2854]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 35.173.207.80

I'll have a look at your files/logs as soon as possible.

Thank you very much. Let me know how i can help.

DL6ER commented 2 years ago

I tested it dnsmasq directly

Did you apply the same config lines used in Pi-hole? We're currently at the bleeding edge of dnsmasq development as this has some important fixes that aren't released officially as we're speaking. Otherwise, FTL does not influence DNS handling in the slightest.

Could you try some older versions of FTL (still had to be version v5.x) in your container, too?

Screenshot_2021-12-27-20-36-55-77__01

This to make the comparison to the other dnsmasq release more fair.

BreiteSeite commented 2 years ago

v5.8.1

dig result:

pi@rpi:~ $ dig @127.0.0.1 careers.intuitive.com
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 careers.intuitive.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 38687
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;careers.intuitive.com.     IN  A

;; Query time: 3 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Dec 27 20:45:27 CET 2021
;; MSG SIZE  rcvd: 50

Dec 27 20:49:11 dnsmasq[857]: query[A] careers.intuitive.com from 10.0.2.26
Dec 27 20:49:11 dnsmasq[857]: forwarded careers.intuitive.com to 172.17.0.1
Dec 27 20:49:11 dnsmasq[857]: dnssec-query[DNSKEY] phenompeople.com to 172.17.0.1
Dec 27 20:49:11 dnsmasq[857]: reply careers.intuitive.com is <CNAME>
Dec 27 20:49:11 dnsmasq[857]: reply intuitive.phenompeople.com is <CNAME>
Dec 27 20:49:11 dnsmasq[857]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 34.205.21.19
Dec 27 20:49:11 dnsmasq[857]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 35.173.207.80
Dec 27 20:49:11 dnsmasq[1000]: query[A] careers.intuitive.com from 10.0.2.26
Dec 27 20:49:11 dnsmasq[1000]: config error is REFUSED

~bug-quad9-upstream-5.8.1.pcap.zip~ (removed)

BreiteSeite commented 2 years ago

image 2021.09 (FTL 5.9)

dig

pi@rpi:~ $ dig @127.0.0.1 careers.intuitive.com
;; Truncated, retrying in TCP mode.
;; communications error to 127.0.0.1#53: end of file

;; communications error to 127.0.0.1#53: end of file

Dec 27 20:53:21 dnsmasq[499]: query[A] careers.intuitive.com from 10.0.2.27
Dec 27 20:53:21 dnsmasq[499]: forwarded careers.intuitive.com to 172.17.0.1
Dec 27 20:53:21 dnsmasq[499]: dnssec-query[DNSKEY] phenompeople.com to 172.17.0.1
Dec 27 20:53:21 dnsmasq[499]: reply careers.intuitive.com is <CNAME>
Dec 27 20:53:21 dnsmasq[499]: reply intuitive.phenompeople.com is <CNAME>
Dec 27 20:53:21 dnsmasq[499]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 35.173.207.80
Dec 27 20:53:21 dnsmasq[499]: reply hubsite-prod13-62952224.us-east-1.elb.amazonaws.com is 34.205.21.19
Dec 27 20:53:21 dnsmasq[880]: query[A] careers.intuitive.com from 10.0.2.27
Dec 27 20:53:21 dnsmasq[881]: query[A] careers.intuitive.com from 10.0.2.26

~bug-quad9-upstream-5.9.pcap.zip~ (removed)

BreiteSeite commented 2 years ago

Sorry for spam but i hoped to increase overall clarity by bundling it into different comments.

So interestingly even version 5.8.1 can not resolve it according to dig whilst in the pcap you can see that actual addresses and CNAME records are returned.

Ninja edit:

Did you apply the same config lines used in Pi-hole?

No - all the configuration is passed via the command directly to the daemons as stated in the docker-compose.yml.

DL6ER commented 2 years ago

Okay, I checked your files very quickly. Unfortunately, they don't contain the reply from the upstream resolver, only the traffic from dig to Pi-hole and back. It'd be helpful to get this in addition but I see that this can get tricky when there is a lot of traffic.

In both cases the DNS reply signals truncation, requesting retry over TCP. The only differences between them are:

A claims to be authoritative for the domain where as B doesn't
A uses a UDP payload size of 4096 bytes, B of 1232 bytes (the minimum size of UDP payload for DNS messages)

where A is broken with cloudflared, B is okay with Quad9 upstream. Nothing special here, it seems.

dig retires over TCP as advised, both pcaps contain the TCP retry from dig. Now the interesting part: The TCP query is replied to in case B but never in case A. What happens here is that a retry over TCP also triggers a retry to upstream over TCP as information might have gone lost before (remember, the query arrived truncated already from upstream).

In the Quad9 case, the reply from upstream arrives and Pi-hole passes this on to your client. With cloudflared, this reply seemingly never arrives back at your Pi-hole and, hence, isn't forwarded to your dig client.

This is now the point where it is basically impossible to continue investigation when we don't have the recorded traffic to and from the upstream server. Could you maybe setup a separate container where the bug is still present but you can record the entire traffic as nobody else is using it? It could listen on a non-default port and dig could be told to use this port instead.

DL6ER commented 2 years ago

Also worth noting before I forget about this is that your first comment shows that a direct query to cloudflared does not want to retry over TCP due to truncation. This likely happens because Pi-hole is requesting additional content as you said DNSSED is enabled. It'll be very interesting to analyze the traffic from/to upstream.

Could you also try a DNSSEC-enabled query directly to cloudflared to see if we get into the same truncation issue?

dig @127.0.0.1 -p 5053 +dnssec careers.intuitive.com

BreiteSeite commented 2 years ago

Okay - i'm sorry. This whole bug report is a total case of user-error. 😅

I setup port-forwarding via traefik to cloudflared but only configured a UDP entrypoint and router. Because if you want to listen for both in traefik you have to registered them twice - one for each protocol

This is why pihole could reach the cloudflare container depending on the protocol used and i guess some payload-sizes in UDP triggered the TCP retry which then failed because traefik never listened and routed that.

So sorry for this and thank you very much for your extensive replies which made it more clear what the actual problem here is.

I'm going to delete the pcap files from the issue report.

Sorry and thanks again.

DL6ER commented 2 years ago

Missing TCP traffic forwarding was my first thought but then I thought "well, he surely has thought about that!" ;-)

BreiteSeite commented 2 years ago

Hey, sorry to bother you again but i'm going crazy on this one. I'm not sure what changed. I accidentally rebooted my pi today and made some other changes unrelated to my pihole setup but it doesn't work anymore.

So.. setup is:

traefik exposes 53 (UDP+TCP) for pihole and 5053 (UDP+TCP) for cloudflared.

So the idea of data flow is

client -> raspberrypi (traefik) :53 ---forwards to---> pihole container ---request-upstream--> 172.17.0.1 (traefik/docker interface) :5053 --> cloudflared --> quad9 upstream

So... when i set 172.17.0.1#5053 as upstream in pihole it doesn't work.

root@rpi:~# dig @127.0.0.1 -p 53 +notcp +noignore bit.ly
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 53 +notcp +noignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23961
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 37148cae44fd561b (echoed)
; EDE: 9 (DNSKEY Missing)
;; QUESTION SECTION:
;bit.ly.                IN  A

;; Query time: 39 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Dec 30 01:09:16 CET 2021
;; MSG SIZE  rcvd: 53

root@rpi:~# dig @127.0.0.1 -p 53 +tcp +noignore bit.ly

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 53 +tcp +noignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 32104
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 0383e948672f44f6 (echoed)
; EDE: 9 (DNSKEY Missing)
;; QUESTION SECTION:
;bit.ly.                IN  A

;; Query time: 131 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Dec 30 01:10:27 CET 2021
;; MSG SIZE  rcvd: 53

However, if i set dig to ignore truncation it works for UDP

root@rpi:~# dig @127.0.0.1 -p 53 +notcp +ignore bit.ly

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 53 +notcp +ignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8769
;; flags: qr aa tc rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: f108bc12bc8bcc3e (echoed)
;; QUESTION SECTION:
;bit.ly.                IN  A

;; ANSWER SECTION:
bit.ly.         101 IN  A   67.199.248.11
bit.ly.         101 IN  A   67.199.248.10

;; Query time: 31 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Dec 30 01:10:57 CET 2021
;; MSG SIZE  rcvd: 91

but not TCP

root@rpi:~# dig @127.0.0.1 -p 53 +tcp +ignore bit.ly

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 53 +tcp +ignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8007
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 241a4d785ad4254f (echoed)
; EDE: 9 (DNSKEY Missing)
;; QUESTION SECTION:
;bit.ly.                IN  A

;; Query time: 75 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Dec 30 01:11:26 CET 2021
;; MSG SIZE  rcvd: 53

So i thought okay maybe the pihole <-> cloudflare-dns link is not working, but thats not the case (this is from inside the pihole container)

root@1c076d91aca4:/# dig @172.17.0.1 -p 5053 +tcp +ignore bit.ly

; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @172.17.0.1 -p 5053 +tcp +ignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22631
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 00fe863cadaad34f (echoed)
;; QUESTION SECTION:
;bit.ly.                                IN      A

;; ANSWER SECTION:
bit.ly.                 281     IN      A       67.199.248.11
bit.ly.                 281     IN      A       67.199.248.10

;; Query time: 1 msec
;; SERVER: 172.17.0.1#5053(172.17.0.1)
;; WHEN: Thu Dec 30 01:14:19 CET 2021
;; MSG SIZE  rcvd: 91

(i left out the digs for +tcp +noignore and +udp +[no]ignore for brevity but they work.

So that somehow means the problem is between host and pihole? It's again only affecting some domains (i guess it's related to package truncation).

I'm actually not sure if it ever worked after my last comment, though the setup was definitive wrong before.

Also, when i change the upstream in pihole to quad9 directly instead of my cloudflared container, it works again.

Can you help me understand whats going on here? As this works depending on the [no]ignore flag i assume my network setup is correct (as proven by successful dig commands from host to pihole and from pihole to cloudflared).

Any help here is appreciated.

Additionally i have this warning in the webinterface:

Here is again a capture from the docker_gwbridge on the pi for dig @127.0.0.1 +notcp bit.ly and dig @127.0.0.1 -p 5053 +notcp bit.ly. (You can filter by port to show the specific sets). You can see that the query to cloudflared works, while the query to pihole has the server response error (2) set. bit-ly-capture-on-docker-gwbridge.pcapng.gz

I also have a dump here from one of the many interfaces docker swarm creates - but i don't know which one that represents. but i think it's from host (docker0) <> pihole based from the IPs - not pihole <> cloudflared. bit-ly-capture-1.pcapng.gz

and here is another one but i'm also not familiar what communication this represents. 172.18.0.1 is docker_gwbridge so to the outside (host). I guess this is between traefik (which makes the port mapping) and some container. bit-ly-capture-veth561d07c.pcapng.gz

BreiteSeite commented 2 years ago

@DL6ER i creates this repo to reproduce so you can test it better on your machine: https://github.com/BreiteSeite/pihole-traefik-udp-bug

This behaves exactly as my setup:

pi@rpi:~/pihole-test $ dig @127.0.0.1 -p 5859 +notcp +noignore +short bit.ly
67.199.248.11
67.199.248.10
pi@rpi:~/pihole-test $ dig @127.0.0.1 -p 5858 +notcp +noignore +short bit.ly
pi@rpi:~/pihole-test $ dig @127.0.0.1 -p 5859 +notcp +noignore +short duck.com
52.142.124.215
pi@rpi:~/pihole-test $ dig @127.0.0.1 -p 5858 +notcp +noignore +short duck.com
52.142.124.215

(Note that i had to set the upstream to 172.19.0.1 otherwise i got hit with

root@8872025e5ae2:/# dig @172.17.0.1 -p 5859 +notcp +noignore bit.ly
;; reply from unexpected source: 172.19.0.1#5859, expected 172.17.0.1#5859

which appears to be a known traefik "bug": https://github.com/traefik/traefik/issues/7430

DL6ER commented 2 years ago

Could you give me some precise commands how to set up a test system inside, say, a Ubuntu VM given your repo? Sorry to ask this but I am myself more a friend of lxc-based virtualization and never actually used docker compose myself so I don't even know where to start and how to set up a system identical to yours. As this will (potentially) be a lot of work, it will also take some time as I have to sneak it in between other things.

What would be interesting, until I can reproduce your system, is the log (/var/log/pihole.log) excerpt corresponding to your tests above. They are all replied to with SERVFAIL. This can have many causes, e.g., SERVFAIL already comes from upstream or DNSSEC validation failed. If the upstream couldn't be connected, the reply might have been REFUSED instead (but, then again, not in every case).

Truncation seems unlikely because in the one case where you +notcp +ignore the MSG SIZE rcvd: 91 is fairly low, however, the packet might have been much larger before when DNSSEC information was attached (actually, it shouldn't be that much larger).

Concerning the warning: This is a limitation on the upstream resolver you configured. Either your local or the upstream one. Check out this discussion: https://discourse.pi-hole.net/t/dnsmasq-warn-reducing-dns-packet-size/51803

Especially these posts:

How to set an appropriate limit to avoid retries https://discourse.pi-hole.net/t/dnsmasq-warn-reducing-dns-packet-size/51803/9?u=dl6er
Why this mechanism is useful: https://discourse.pi-hole.net/t/dnsmasq-warn-reducing-dns-packet-size/51803/30
What limits large public DNS servers have: https://discourse.pi-hole.net/t/dnsmasq-warn-reducing-dns-packet-size/51803/31

BreiteSeite commented 2 years ago

Could you give me some precise commands how to set up a test system inside, say, a Ubuntu VM given your repo?

Sure thing. It should be fairly easy so i would assume apt install docker docker-compose should do the trick for the installation part and the cd <repo> and docker-compose up -d to boot it. That should be it. If you're on some messaging platform (IRC/Telegram/Slack/Signal) i am also happy to help you directly in case you run into issues.

BreiteSeite commented 2 years ago

What would be interesting, until I can reproduce your system, is the log (/var/log/pihole.log) excerpt corresponding to your tests above.

Running the same commands in the same order as above - output spaced by newlines (port 5859 commands are directly to cloudflared so they don't appear in the pihole.log):

Dec 30 17:24:25 dnsmasq[446]: query[A] pi.hole from 127.0.0.1
Dec 30 17:24:25 dnsmasq[446]: Pi-hole hostname pi.hole is 0.0.0.0
Dec 30 17:24:32 dnsmasq[446]: query[A] bit.ly from 172.19.0.3
Dec 30 17:24:32 dnsmasq[446]: forwarded bit.ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[446]: dnssec-query[DS] ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[446]: reply ly is DS keytag 62311, algo 8, digest 2
Dec 30 17:24:33 dnsmasq[446]: dnssec-query[DS] bit.ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[446]: dnssec-query[DNSKEY] ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[446]: reply bit.ly is 67.199.248.10
Dec 30 17:24:33 dnsmasq[446]: reply bit.ly is 67.199.248.11
Dec 30 17:24:33 dnsmasq[511]: query[A] bit.ly from 172.19.0.3
Dec 30 17:24:33 dnsmasq[511]: forwarded bit.ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[511]: dnssec-query[DS] bit.ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[511]: dnssec-query[DNSKEY] ly to 172.19.0.1
Dec 30 17:24:33 dnsmasq[511]: validation bit.ly is BOGUS
Dec 30 17:24:33 dnsmasq[511]: reply bit.ly is 67.199.248.10
Dec 30 17:24:33 dnsmasq[511]: reply bit.ly is 67.199.248.11

Dec 30 17:24:56 dnsmasq[446]: query[A] pi.hole from 127.0.0.1
Dec 30 17:24:56 dnsmasq[446]: Pi-hole hostname pi.hole is 0.0.0.0
Dec 30 17:25:02 dnsmasq[446]: query[A] duck.com from 172.19.0.3
Dec 30 17:25:02 dnsmasq[446]: forwarded duck.com to 172.19.0.1
Dec 30 17:25:02 dnsmasq[446]: validation result is INSECURE
Dec 30 17:25:02 dnsmasq[446]: reply duck.com is 40.89.244.232

DL6ER commented 2 years ago

Starting from your repo works on a fresh Ubuntu 20.04 with latest docker installed:

Creating pihole-traefik-udp-bug_traefik_1     ... done
Creating pihole-traefik-udp-bug_cloudflared_1 ... done
Creating pihole-traefik-udp-bug_pihole_1      ... done

However, it does not really do what we expect it to (at least not out-of-the-box);

dominik@pihole-traefil-udp-bug:~$ dig @127.0.0.1 -p 5859 +notcp +noignore +short bit.ly
67.199.248.10
67.199.248.11

dominik@pihole-traefil-udp-bug:~$ dig @127.0.0.1 -p 5858 +notcp +noignore +short bit.ly
;; connection timed out; no servers could be reached

dominik@pihole-traefil-udp-bug:~$ dig @127.0.0.1 -p 5859 +notcp +noignore +short duck.com
52.142.124.215

dominik@pihole-traefil-udp-bug:~$ dig @127.0.0.1 -p 5858 +notcp +noignore +short duck.com
;; connection timed out; no servers could be reached

dominik@pihole-traefil-udp-bug:~$ sudo docker ps
CONTAINER ID   IMAGE                             COMMAND                  CREATED         STATUS                   PORTS                                                                                                                                  NAMES
c118b4e8e834   pihole/pihole:latest              "/s6-init"               3 minutes ago   Up 3 minutes (healthy)   53/udp, 53/tcp, 67/udp, 0.0.0.0:8300->80/tcp, :::8300->80/tcp                                                                          pihole-traefik-udp-bug_pihole_1
18a86bf28e30   raspbernetes/cloudflared:latest   "cloudflared --no-au…"   3 minutes ago   Up 3 minutes                                                                                                                                                    pihole-traefik-udp-bug_cloudflared_1
f7b0301a20cd   traefik:v2.5                      "/entrypoint.sh --ap…"   3 minutes ago   Up 3 minutes             80/tcp, 0.0.0.0:5858-5859->5858-5859/tcp, 0.0.0.0:5858-5859->5858-5859/udp, :::5858-5859->5858-5859/tcp, :::5858-5859->5858-5859/udp   pihole-traefik-udp-bug_traefik_1

I'm running out of time for today. Any suggestions?

Pi-hole does seem to run:

dominik@pihole-traefil-udp-bug:~$ sudo docker exec c118b4e8e834 pihole status
  [✓] DNS service is listening
     [✓] UDP (IPv4)
     [✓] TCP (IPv4)
     [✓] UDP (IPv6)
     [✓] TCP (IPv6)

  [✓] Pi-hole blocking is enabled

dominik@pihole-traefil-udp-bug:~$ sudo docker exec c118b4e8e834 dig +short google.de
74.125.133.94

BreiteSeite commented 2 years ago

I'm running out of time for today. Any suggestions?

yes - you need to configure the IP of the cloudflared container as upstream in pihole. I guess on your setup it got a different IP than on mine assigned. I'm not sure how docker behaves in a VM as well. I think the easiest would be to just run the docker container on your host - i mean thats the idea of container - that it's kinda isolated from your host already anyways. :)

But if your not comfortable with that you can find the IP of the cloudflared container by running sudo docker inspect $(sudo docker-compose ps -q cloudflared). If you have jq installed, this reduces the output a to the relevant section sudo docker inspect $(sudo docker-compose ps -q cloudflared) | jq '.[0].NetworkSettings.Networks'

(needs to be run in the directory of the repository)

DL6ER commented 2 years ago

The VM I'm using is a true virtualization, unlike the hybrid solutions such as docker, the virtual operating system does not even know it is virtual. (obviously, at the extra costs of disk space, memory, etc.) I have several virtualized machines running on a server with enough RAM and simply ssh into them from remote as they appear the same as bare-metal servers to the outside but with very simple backup/restore/discard/archive/resume capabilities. don't expect it to cause any issues.

Your command returns:

{
  "pihole-traefik-udp-bug_default": {
    "IPAMConfig": null,
    "Links": null,
    "Aliases": [
      "cloudflared",
      "18a86bf28e30"
    ],
    "NetworkID": "e5a998204b9327b2ec271d160d41c5c6150644e0c435113b2158fd7e8eebe689",
    "EndpointID": "773ee2178de15695c3ec5e58be99c05211f0df716fcb62fb3328f680208c7116",
    "Gateway": "172.18.0.1",
    "IPAddress": "172.18.0.4",
    "IPPrefixLen": 16,
    "IPv6Gateway": "",
    "GlobalIPv6Address": "",
    "GlobalIPv6PrefixLen": 0,
    "MacAddress": "02:42:ac:12:00:04",
    "DriverOpts": null
  }
}

This doesn't seem to work, though:

dominik@pihole-traefil-udp-bug:~$ sudo docker ps
CONTAINER ID   IMAGE                             COMMAND                  CREATED          STATUS                    PORTS                                                                                                                                  NAMES
128ffdbec214   pihole/pihole:latest              "/s6-init"               37 minutes ago   Up 37 minutes (healthy)   53/udp, 67/udp, 0.0.0.0:5333->53/tcp, :::5333->53/tcp, 0.0.0.0:8300->80/tcp, :::8300->80/tcp                                           pihole-traefik-udp-bug_pihole_1
18a86bf28e30   raspbernetes/cloudflared:latest   "cloudflared --no-au…"   57 minutes ago   Up 57 minutes                                                                                                                                                    pihole-traefik-udp-bug_cloudflared_1
f7b0301a20cd   traefik:v2.5                      "/entrypoint.sh --ap…"   57 minutes ago   Up 57 minutes             80/tcp, 0.0.0.0:5858-5859->5858-5859/tcp, 0.0.0.0:5858-5859->5858-5859/udp, :::5858-5859->5858-5859/tcp, :::5858-5859->5858-5859/udp   pihole-traefik-udp-bug_traefik_1

dominik@pihole-traefil-udp-bug:~$ sudo docker exec 128ffdbec214 dig @172.18.0.4 -p 5859 google.de

; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @172.18.0.4 -p 5859 google.de
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

BreiteSeite commented 2 years ago

The VM I'm using is a true virtualisation, unlike the hybrid solutions such as docker, the virtual operating system does not even know it is virtual.

Okay - to be fair i'm not very familiar with VMs since it's been like 7 years that i used them. :) If you don't mind what do you mean by hybrid solution? Because docker is basically just a frontend for the process virtualisation - kinda similar to lxc just more user-friendly (IMHO ;)).

This doesn't seem to work

Sorry - i was actually wrong. The upstream IP shouldn't be that from the cloudflared container directly as the traffic should go to traefik which routes it to (one of the) cloudflared container then based on the entrypoint port (53 tcp+upd). So the IP you need to use is the one from the gateway - 172.18.0.1 in your example.

sudo docker-compose exec pihole dig @172.18.0.1 -p 5859 google.de

Should do the trick.

Thank you for your effort on this investigation.

DL6ER commented 2 years ago

Okay, so this work. TL;DR: Noting unexpected happens. The problem is upstream and not with Pi-hole.

Let's start with what you have done, too:

dominik@pihole-traefil-udp-bug:~/$ dig @127.0.0.1 -p 5858 +notcp +noignore bit.ly
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.16.1-Ubuntu <<>> @127.0.0.1 -p 5858 +notcp +noignore bit.ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55227
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 7870ad0f6f280043 (echoed)
; OPT=15: 00 09 ("..")
;; QUESTION SECTION:
;bit.ly.                IN  A

;; Query time: 79 msec
;; SERVER: 127.0.0.1#5858(127.0.0.1)
;; WHEN: Fri Dec 31 11:59:03 UTC 2021
;; MSG SIZE  rcvd: 53

same result as you got. (interesting to note +notcp is set but dig is still "retrying in TCP mode")

dominik@pihole-traefil-udp-bug:~/$ sudo docker exec -it 128ffdbec214 tail /var/log/pihole.log
Dec 31 12:59:02 dnsmasq[6601]: query[A] bit.ly from 172.18.0.2
Dec 31 12:59:02 dnsmasq[6601]: forwarded bit.ly to 172.18.0.1
Dec 31 12:59:02 dnsmasq[6601]: dnssec-query[DS] bit.ly to 172.18.0.1
Dec 31 12:59:03 dnsmasq[6601]: dnssec-query[DNSKEY] ly to 172.18.0.1
Dec 31 12:59:03 dnsmasq[6601]: validation bit.ly is BOGUS
Dec 31 12:59:03 dnsmasq[6601]: reply bit.ly is 67.199.248.11
Dec 31 12:59:03 dnsmasq[6601]: reply bit.ly is 67.199.248.10

The DNSSEC validation returned that bit.ly is BOGUS. Returning SERVFAIL is expected behavior in this case. Interestingly enough, this is wrong. I checked the DNSSEC path and bit.ly is not actually using DNSSEC at all. So it should be INSECURE instead. As everything works with a different upstream in Pi-hole, I suspect that cloudflared is somehow giving a corrupt response. Enabling log-queries=extra in 01-pihole.conf of the Pi-hole container gives further details:

Dec 31 13:09:21 dnsmasq[6974]: 10 172.18.0.2/49814 query[A] bit.ly from 172.18.0.2
Dec 31 13:09:21 dnsmasq[6974]: 10 172.18.0.2/49814 forwarded bit.ly to 172.18.0.1
Dec 31 13:09:21 dnsmasq[6974]: 11 dnssec-query[DS] bit.ly to 172.18.0.1
Dec 31 13:09:21 dnsmasq[6974]: 12 dnssec-query[DNSKEY] ly to 172.18.0.1
Dec 31 13:09:21 dnsmasq[6974]: 10 172.18.0.2/49814 validation bit.ly is BOGUS (EDE: DNSKEY missing)
Dec 31 13:09:21 dnsmasq[6974]: 10 172.18.0.2/49814 reply bit.ly is 67.199.248.10
Dec 31 13:09:21 dnsmasq[6974]: 10 172.18.0.2/49814 reply bit.ly is 67.199.248.11

(see the EDE message).

This is now a situation that is very difficult to handle as doing DNSSEC validation by hand is ... time-consuming (to put it mildly). Not something that is really feasible. So I recorded a pcap with filter port (53 or 5859) in the pi-hole container: dns.zip

Check where my mouse is in the following screenshots:

This is the DNSKEY ly response incoming from traefik: It is truncated Screenshot from 2021-12-31 13-28-52

Hence, we cannot do DNSSEC validation and return NOERROR but tell dig that the reply is truncated. Screenshot from 2021-12-31 13-26-50

Now dig indeed retries over TCP. Pi-hole forwards again to traefik and receives a reply:

Screenshot from 2021-12-31 13-34-55

However Even over TCP, where this is not even possible, traefik tells us this is a truncated reply. Setting the TR (truncated) bit in TCP replies is meaningless. All it says according to the standard is "retry over TCP so there is no limit"

Whether this comes from traefik or actually cloudflared remains unclear. At this point, Pi-hole is correctly labeling the reply as BOGUS + SERVFAIL as the upstream is not behaving as it should. This makes me very certain that we are not actually looking at a Pi-hole bug here.

What I'd need next is a way to reliably record a tcpdump of all the containers. Is there a simple way, maybe even integrated into docker for doing this ? While it was straightforward to install tcpdump in the Pi-hole containers, I don't know how to install tcpdump in the other two.

BreiteSeite commented 2 years ago

What I'd need next is a way to reliably record a tcpdump of all the containers. Is there a simple way, maybe even integrated into docker for doing this?

Interesting question. I found this article having an interesting solution. If you are on x64 you can skip the building of the (local) docker container and instead just run the command:

sudo docker run --tty --net=container:CONTAINER_NAME havmand/tcpdump

with CONTAINER_NAME being replaced by the container-name you are interested in (easily found by running sudo docker-compose ps)

DL6ER commented 2 years ago

Thanks, for reference, here are the files: pihole-traefik-udp-bug_pcaps.zip

This is a cloudflared bug: Screenshot from 2021-12-31 14-53-34 Just for completeness, the request itself did not have the bit set as can be seen here:

The relevant Internet standard is RFC 2181: Clarifications to the DNS Specification, more specifically the last sentence of section 9:

9. The TC (truncated) header bit The TC bit should be set in responses only when an RRSet is required as a part of the response, but could not be included in its entirety. The TC bit should not be set merely because some extra information could have been included, but there was insufficient room. This includes the results of additional section processing. In such cases the entire RRSet that will not fit in the response should be omitted, and the reply sent as is, with the TC bit clear. If the recipient of the reply needs the omitted data, it can construct a query for that data and send that separately. Where TC is set, the partial RRSet that would not completely fit may be left in the response. When a DNS client receives a reply with TC set, it should ignore that response, and query again, using a mechanism, such as a TCP connection, that will permit larger replies.

This is further clarified by RFC 5966: DNS Transport over TCP - Implementation Requirements, section 3:

In the absence of EDNS0 (Extension Mechanisms for DNS 0) (see below), the normal behaviour of any DNS server needing to send a UDP response that would exceed the 512-byte limit is for the server to truncate the response so that it fits within that limit and then set the TC flag in the response header. When the client receives such a response, it takes the TC flag as an indication that it should retry over TCP instead.

I suggest you contact `cloudflared` maintainers and ask how a TCP query can have the truncated bit set and how they expect the client to react.

It can be reproduced easily using

dig @127.0.0.1 -p 5859 +tcp DNSKEY ly +dnssec

which shows flags: qr tc rd ra ad but tc should not be there.

I prepared a tcpdump with reduced noise for their support: pihole-traefik-udp-bug_cloudflared_1_2.zip

With all this, it should be straightforward for them to fix this.

BreiteSeite commented 2 years ago

However Even over TCP, where this is not even possible, traefik tells us this is a truncated reply. Setting the TR (truncated) bit in TCP replies is meaningless.

I wonder if this because at some point the dns.flags are copied from the original attempt and the truncated bit is not reset.

(interesting to note +notcp is set but dig is still "retrying in TCP mode")

I think this is because of +noignore which controls the retry behavior

+[no]ignore This option ignores [or does not ignore] truncation in UDP responses instead of retrying with TCP. By default, TCP retries are performed.

Okay so from what i understand is that there are two issues here:

1) cloudflared inappropriately sets the TC flag for tcp response 2) cloudflared incorrectly validates the DNSSEC for bit.ly as BOGUS which should be INSECURE because domain validation is possible?

Two follow-up questions from that: 3) if the TC flags for TCP responses is meaningless - why would this cause pihole to refuse to answer correctly? shouldn't dnsmasq just ignore the tc flag then? 4) also why can cloudlfared itself resolve the query correctly but pihole can not?

pi@rpi:~ $ dig @127.0.0.1 -p 5859 +tcp +dnssec +short bit.ly
67.199.248.10
67.199.248.11
pi@rpi:~ $ dig @127.0.0.1 -p 5858 +tcp +dnssec +short bit.ly
pi@rpi:~ $

I suggest you contact cloudflared maintainers and ask how a TCP query can have the truncated bit set and how they expect the client to react.

I would love to help out here but i feel you are way more knowledgable in the DNS protocol and the client stack and i think it would be more efficient communciation if you could file this bug with them (the affected repository)? I would just be a proxy of communication which i think is not desirable for both sides.

DL6ER commented 2 years ago

However Even over TCP, where this is not even possible, traefik tells us this is a truncated reply. Setting the TR (truncated) bit in TCP replies is meaningless.

I wonder if this because at some point the dns.flags are copied from the original attempt and the truncated bit is not reset.

We can get the fail right away on the first query (see the last part of my post before). There is no original attempt. We performed everything over TCP in the simple test case.

if the TC flags for TCP responses is meaningless - why would this cause pihole to refuse to answer correctly? shouldn't dnsmasq just ignore the tc flag then?

because RFC 2181 explicitly says (see bold text above):

When a DNS client receives a reply with TC set, it should ignore that response [...]

It doesn't say we should do this only for UDP queries. As consequence, the TC bit set in TCP reply means a hard fail as there are no other means we could try here. DNSSEC validation in Pi-hole then fails because critical parts in the chain-of-trust failed.

also why can cloudlfared itself resolve the query correctly but pihole can not?

Because cloudflared doesn't seem to be doing any DNSSEC validation. It will give you whatever you ask for without any DNSSEC check.

DL6ER commented 2 years ago

Issue ticket submitted. Please subscribe to it @BreiteSeite in case they have questions you can answer, too. Like the version of cloudflared or any other details of the docker setup. I don't really have the time to debug cloudflared, this investigation here already took a lot of time.

BreiteSeite commented 2 years ago

Thanks for filing that bug upstream and your patience troubleshooting and explaining this to me. The issue got a lot clearer to me with your last post.

I just subscribed and will help out there as best as i can.

One question i still have though:

pi@rpi:~ $ dig @127.0.0.1 -p 5859 +tcp DNSKEY +dnssec ly

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 5859 +tcp DNSKEY +dnssec ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31066
;; flags: qr tc rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
; COOKIE: a73e701c7cc2c08cab3ce8c161cf1c8ceb3bc077b6761f8d (good)
;; QUESTION SECTION:
;ly.                IN  DNSKEY

;; Query time: 351 msec
;; SERVER: 127.0.0.1#5859(127.0.0.1)
;; WHEN: Fri Dec 31 16:06:52 CET 2021
;; MSG SIZE  rcvd: 59

Why would dig report udp: 4096 in the OPT PSEUDOSECTION?

DL6ER commented 2 years ago

We're always learning together.

Why would dig report udp: 4096 in the OPT PSEUDOSECTION?

This is fine. It just tells your resolver that up to 4096 bytes payloads are transmittable over UDP. It's always there. More info, e.g. here (to not always cite Internet standards alone).

One other things that just came to my mind, I've seen you selected --upstream https://dns11.quad9.net/dns-query which is not cloudflare. Why did you chose this? And, even more interesting perhaps: Does Cloudflare's DoH server show the same abnormal behavior (TC bit + empty response) for the dig above?

BreiteSeite commented 2 years ago

This is fine. It just tells your resolver that up to 4096 bytes payloads are transmittable over UDP. It's always there. More info, e.g. here (to not always cite Internet standards alone).

Thanks. I wasn't aware this was client-side information. I need to get more familiar with dig and DNS at some point i guess.

One other things that just came to my mind, I've seen you selected --upstream https://dns11.quad9.net/dns-query which is not cloudflare. Why did you chose this?

Well i choose cloudflared because it seemed like a small DoH client that does everything i would need it to do which also receives very frequent updates and had an arm64 docker image ready and does not require a lot of configuration.

quad9 i choose for privacy reasons

I was under the assumption that this is a valid combination as this is even mentioned in the pi-hole docs:

And, even more interesting perhaps: Does Cloudflare's DoH server show the same abnormal behavior (TC bit + empty response) for the dig above?

Actually very interesting because it doesn't.

pi@rpi:~/pihole-test $ sudo docker-compose logs cloudflared
pihole-test-cloudflared-1  | 2021-12-31T16:38:33Z INF Adding DNS upstream url=https://1.1.1.1/dns-query
pihole-test-cloudflared-1  | 2021-12-31T16:38:33Z INF Starting DNS over HTTPS proxy server address=dns://0.0.0.0:5811
pihole-test-cloudflared-1  | 2021-12-31T16:38:33Z INF Starting metrics server on 127.0.0.1:37025/metrics

pi@rpi:~/pihole-test $ dig @127.0.0.1 -p 5859 +tcp DNSKEY +dnssec ly

; <<>> DiG 9.16.22-Debian <<>> @127.0.0.1 -p 5859 +tcp DNSKEY +dnssec ly
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12381
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
; COOKIE: 1ecc3775cfdafd2c (echoed)
;; QUESTION SECTION:
;ly.                IN  DNSKEY

;; ANSWER SECTION:
ly.         3575    IN  DNSKEY  256 3 8 AwEAAYdSpuTFbv0JmMYpI1cWcR/jVIOmPvo1eJnYS+VUStiGfTXvz26R UtU0LPEECV+06X1OXYiLUbt2x2XgqKJQTFvJd6Jo6Yhwr+VCuEPadNe4 4Omhs1Sp8btMWnR57o8VDkV1c+q82QPeD0krwnU6UnYdcztDAUfk75Dq +QuP4fAp0Fvi7ggivxI/nIhxON2GheHBMbU7VysnSVtx7RGt1hTgmtny fwEtZWsbVJUCfXjRANglgtIogul4hmgHwGGXfPlK2u7l/y661rHlzg4B UyT1+iVSKv18ecpJ0RpVnNpPypvGlP/oDaumgoMTB3MZL2BmDIUMh2aE 6VSnd7WcyDU=
ly.         3575    IN  DNSKEY  257 3 8 AwEAAcbhsGvQrDj3AA1GU31NNkDhJe+ghg7C+sHdX/gnBpFWMohDkZIX LVPTZ7i+6uHktDguiIImS+X1F8l6vBguMUeLn82zwtiY05SPDGezNavk Dxp9QkdANVpOhvLruoCOAtLHKNhUwPvdk21ZkTauZYctCBNRXs5r7/8m mbg0O9O1qouP3KovP6Oo7N2VhpzfaqR0zRRYpvTt0oqabTplbVYihU7C xxgXMY5WLSftHV7faedrbPRjgesFELLykxQWC8TPZ9XnqzD9mUkpDjjz 2bfgQKaVdOtu6Q0MH7OF0J3g58NL6tATmj3+gN0vf1nbR515CVsapOOE Vt7Rl0rrPkvb2pSZXfR65BGDVJMA1rRPCW2ZL1x2sU+trVFv1eRonA67 LYipR2I9+wy71OWuiTpxjG74EvlDRMYVMPmJo4AoDo1v5ZpLN9yDH5Fy LV2Cg+IXNXO9jOstxWIjK3xkhbDCr8LcKU7p9JL44WhM6D+dSBqesyGG W97s2pUUxIEL0FnwO5R56vhvRmDy5iVIw/iYx7+RpeJypXjIIBS2lniZ 3SNdFDeuSQOSn+owOvccbFNJcDBBFpfBBerukV+LgN3Z/Q2zFWG9SNwV oQapYM80MB/85RMgN0CO/wvqUitdlnsCOnfYgGw+4tS3LWh5kw4SIMUB 0YdlhJ3BUNhFf+F1
ly.         3575    IN  RRSIG   DNSKEY 8 1 172800 20220115180000 20211229180000 62311 ly. fJlYyZ6kkzOpgI6+zJhjl9SOhTAECZnOpcV1wXmAVhMQHTiyCIDb9vLd YHl+tZMsWoDH0Bya+DhKVvxPVqSfc6wo2w5dFgsSmDOfI+qZrHb10o+9 7t7VaR37VJsL3f9hhueYhEZH4DgRXqxAABp/FG6mx+VSjrIJGUvI8wCv hzR57OIjujQqsiej4zIjYDdwrgCwrRz4Ued85d038d9E+lkvBKK3hYcF 4ILZOcfJ6jXKHDttlk+6NMU18uCnnnRuUz9rbgr1YRgAP/Z+Kvi29qSh Y+Xvml+wOPzJDzu6H2uxD9+tcIFoqrl92lqfHoIh90wq9a6i7IUxFzKa oHr5pSVIX5S50/MFQo4UxXX2IoJzcOCBJRTzyWWOmxlRxk4P7H8BxD8Y 65fbqBNzrFCNuliLxh68uXq/S508lBEuQyEs4dFTC0x2dyBmOV7cAYeZ RlD645MOjxwyiGhYnCgTgZZAoIPor1qUyOg4osJDc9357wN5ehSHHSI9 qoFLxqFmmflitNxIKCkPtUemAnnCTN7NtJjA551/HMAWa2mX0RCoM7RD cJohzJL5RpZKoWDqvEzj1SdzbDomhZZFFyUvfoNXLJjWvrJ/tzAc6QDH A6oNzsLwW29Ft8w02ev4lHriaZXPW5dZswbIC8u8Tm2oDJByY6D28PAn IUo6bqiYOI0=

;; Query time: 3 msec
;; SERVER: 127.0.0.1#5859(127.0.0.1)
;; WHEN: Fri Dec 31 17:39:10 CET 2021
;; MSG SIZE  rcvd: 1403

I will add this to the linked issue.

DL6ER commented 2 years ago

So it's likely even a Quad9 bug, not cloudflared which may just stupidly pass along whatever it gets served from upstream.

BreiteSeite commented 2 years ago

I agree. As you flagged this upstream (thank you) - i will close this issue.

I use cloudflares DNS upstream as a workaround.

pralor-bot commented 2 years ago

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/debian-org-does-not-seem-to-resolve/55158/2