DNS resolution failed until restart of container

alexandre-abrioux commented 2 years ago

Hi! First of all, thank you for your amazing work, I'm a big fan of your project.

I have a question for you: does the LRU cache saves empty responses? I'm asking this because something strange happened yesterday.

FYI I upgraded to v2.0.0-beta about ~2 weeks ago. Everything was working fine until yesterday.

For context, some of my services resolve very often (like once every second) a specific hostname—let's call it acme.com. So they are basically hitting the LRU most of the time, if not all the time.

Yesterday that hostname couldn't get resolved anymore by my services for some reason. After I noticed the issue I tried:

> ping acme.com
ping: acme.com: Name or service not known

But at the same time I could resolve google.com

> ping google.com
PING google.com(fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e)) 56 data bytes
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=1 ttl=118 time=5.11 ms
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=2 ttl=118 time=5.25 ms
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=3 ttl=118 time=5.33 ms

Once I restarted the DNS container, the hostname could get resolved once again right away.

My config is very simple:

version: "3.7"

services:
  dns-server:
    image: qmcgaw/dns:v2.0.0-beta
    container_name: dns-server
    restart: always
    environment:
      - BLOCK_MALICIOUS=on
      - BLOCK_SURVEILLANCE=on
      - BLOCK_ADS=on
      - METRICS_TYPE=prometheus
    ports:
      - 127.0.0.1:53:53/udp
      - 172.17.0.1:53:53/udp
      - 172.17.0.1:9892:9090/tcp
    network_mode: bridge

Those are the last logs that got written before I restarted my DNS container. I got some warnings but I don't think they are important. FYI the issue started the 28th around 17:30.

2021/10/27 21:46:39 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:35246->1.1.1.1:853: i/o timeout
2021/10/27 23:00:01 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:53620->1.1.1.1:853: i/o timeout
2021/10/28 01:54:45 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:60710->1.1.1.1:853: i/o timeout
2021/10/28 03:42:38 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:50438->1.1.1.1:853: i/o timeout
2021/10/28 03:54:29 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:54982->1.0.0.1:853: i/o timeout
2021/10/28 09:49:23 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:36820->1.0.0.1:853: i/o timeout
2021/10/28 11:09:58 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:36598->1.1.1.1:853: i/o timeout
2021/10/28 12:46:54 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:33552->1.0.0.1:853: i/o timeout
2021/10/28 13:26:33 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:43944->1.0.0.1:853: i/o timeout
2021/10/28 21:17:11 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:52100->1.1.1.1:853: i/o timeout
2021/10/28 21:29:20 INFO planned periodic restart of DNS server
2021/10/28 21:29:20 INFO downloading and building DNS block lists
2021/10/28 21:29:21 INFO 1220920 hostnames blocked overall
2021/10/28 21:29:21 INFO 29956 IP addresses blocked overall
2021/10/28 21:29:21 INFO 2482 IP networks blocked overall
2021/10/28 21:29:21 INFO starting DNS server
2021/10/28 21:29:21 INFO DNS server listening on :53

Could it be possible that the DNS server saved an empty resolution in the LRU during one of those timeouts and then kept serving that empty resolution forever to my services? What do you think?

Thank you very much for your help.

qdm12 commented 2 years ago

Hey there! I'm glad you're having fun with it.

I had a similar issue where it would not manage to resolve my own dns record (empty cached response) and my ddns container would update the record like a madman.... Not sure if that's related.

The code adds to the lru cache here: https://github.com/qdm12/dns/blob/master/pkg/cache/lru/lru.go#L34

And it doesn't check if the response is empty. I will add that. The cache get function however respects the ttl of the record. So if it's empty and got updated, it will get evicted from the cache once it's set ttl is over (usually 1hr).

qdm12 commented 2 years ago

Empty response (or response will N empty records) will not be inserted into the cache with commit 70229f5b1644416ee99310e653e110b00a33c9bc

Let me know if you encounter the problem again. Enjoy! 😉

alexandre-abrioux commented 2 years ago

Wow, that was fast! Thank you for your help :)

qdm12 commented 2 years ago

A bit too fast, commit a750d22112a333afae04b702933741f9fed18556 overwrites it since one my test was failing 😄 Image should be built soon.

Kampe commented 2 years ago

Assuming this is added to the latest build, I'm seeing the same issues still with version v2.0.0-beta built on 2022-03-31T18:11:07Z (commit ad19e23)

After about a week of functioning normally and properly, this begins happening. Have now seen it on two different raspi hosts running bullseye. Solution currently is to restart the container as the ServFails just persist indefinitely.

2022/06/22 16:27:10 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 16:32:38 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:32:52 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:35:29 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:38:05 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 16:40:50 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 17:03:46 WARN cannot dial DoT server: dial tcp 8.8.8.8:853: i/o timeout
2022/06/22 17:04:49 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 17:14:53 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 19:20:55 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 21:14:33 WARN cannot exchange over DoT connection: read tcp 192.168.144.2:36378->8.8.4.4:853: i/o timeout
2022/06/22 22:29:00 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 22:36:28 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 22:37:27 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 23:07:55 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 23:27:55 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 23:52:42 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 23:54:19 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 00:02:54 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/23 00:29:55 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/23 00:31:14 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 00:35:58 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:05:01 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:15:40 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:27:02 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:48:47 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout

qdm12 commented 2 years ago

@Kampe maybe it loses connectivity for another reason? I have it running on amd64 for weeks without issue here 🤔 Maybe try setting up the prometheus + grafana dashboard https://github.com/qdm12/dns/tree/v2.0.0-beta/readme/metrics to see why/how it fails?

Kampe commented 2 years ago

Alright got it all setup to monitor, didnt know you had added metrics! I'll keep an eye out and relay what I find here :)

Thanks!

Kampe commented 2 years ago

Well, found the issue is with the cache - happened much faster this time and just started getting misses. Screen Shot 2022-06-23 at 6 20 50 PM

Screen Shot 2022-06-23 at 6 19 41 PM

Any ideas?

qdm12 commented 2 years ago

The cache is a consequence of the dns server not working anymore I would say. Although i don't know why the cache entries went to zero, have you rebooted the container at 4:50?

Also why do you get such high dns traffic from 4:50 once the dns server doesn't work anymore? Is it things retrying hard?

The problem is really those i/o timeout. Try increasing DOT_TIMEOUT=5s? But my guess something else on your system/router/ISP is blocking outbound traffic to 853. Those timeouts usually happen when the connection is down or packets get dropped.

I'm running the last v2 beta image on my server (amd64) though, so I'll let you know if I get the same problem.

Kampe commented 2 years ago

well, heh, through my make commands I just pulled your latest updates to the v2.0.0-beta and no longer get any response from the server whatsoeever unfortunately :(

as for the spike, yeah I believe that's my entire network retrying their queries, I also front the server with pihole

I'll add the timeout and see what I get

here's my updated config:

cloudflared:
    container_name: cloudflared
    image: qmcgaw/dns:v2.0.0-beta
    environment:
      - LOG_LEVEL=info
      - BLOCK_MALICIOUS=off
      - BLOCK_SURVEILLANCE=off
      - BLOCK_ADS=off
      - CHECK_DNS=off
      - UPSTREAM_TYPE=DoT
      - DOT_RESOLVERS=cloudflare,google
      - DOH_RESOLVERS=cloudflare,google
      - DNS_FALLBACK_PLAINTEXT_RESOLVERS=cloudflare
      - LISTENING_ADDRESS=:5054
      - UPDATE_PERIOD=48h
      - METRICS_TYPE=prometheus
      - METRICS_PROMETHEUS_ADDRESS=:9091
      - DOT_TIMEOUT=5s
    restart: always
    cap_add:
      - NET_ADMIN
      - NET_RAW
    ports:
      - 5054:5054/tcp
      - 5054:5054/udp

Those graphs were from normal operation without any intervention at all, so the container was not restored until right before I screenshotted those.

qdm12 commented 2 years ago

well, heh, through my make commands I just pulled your latest updates to the v2.0.0-beta and no longer get any response from the server whatsoeever unfortunately :(

Recent images from today were broken (oops my bad but it's beta), try repulling it the last image v2.0.0-beta should be working 😉

I'm aiming at releasing a stable v2 this weekend-ish (after a year lol), so I'm definitely interested if it's unstable.

Kampe commented 2 years ago

All good, beta it is! It doesnt seem like my listening ports are being respected anymore?

├── DNS upstream connection: dot
├── DNS server listening address: :53
├── DNS over TLS:
|   ├── DNS over TLS providers: cloudflare and google
|   ├── Plaintext fallback DNS providers: cloudflare
|   ├── Request timeout: 5s
|   └── Connecting over: IPv4

also I attempted to set the update_period to 0 and I get yelled at come run time

Also did cache metrics receive a change? None seem to appear in the dashboard anymore

qdm12 commented 2 years ago

Ok so I broke a few things when doing #111 recently, fixed them in this build in progress:

listening address is fixed in 63d2081ec76234d9c8eb57f9f0fba1d91c2f2ca7
update period is fixed in 1eeac399af6f99611d51566410ce61cbdb6bd4fe to allow 0
Cache Prometheus metrics (subsystem empty) fixed in 073bb9034f84d17a597fea9ac0a87745e2cfce20
Filter Prometheus metrics fixed in 500764e1120f9db9208addfa6fcd32dc4392b570

I'll monitor my own dashboard see if I can spot any oddities as well.

Kampe commented 2 years ago

Pulled the latest and can verify all the above are working again! Thank you very much!

I'll keep a close eye on performance and reference this chat again if needed!

Kampe commented 2 years ago

So, just had the issue crop up again this evening with the very latest build unfortunately, I have two pis running retrieving the exact same request loads, one still running just fine while the other is not responding any longer, they've both been running fine for about 48 hours now.

Screen Shot 2022-06-27 at 12 07 20 AM

Logs full of:

2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded

Heres the metrics from the container thats still functioning normally, you'll notice the spike there when the other went down. Screen Shot 2022-06-27 at 12 12 21 AM

To be fair, this "functioning" container's logs are full of the very same context deadline exceeded log outs as the "no-longer functioning" container

Next time this happens is there anything on the container you want me to shell in and poke around at?

qdm12 commented 2 years ago

Are you sure the last log line from the failed container isn't about the DNS server restarting? I had the bug recently where the server would hang at shutting down before restarting to get updated block lists. I cannot reproduce it in development though, but I changed the shutdown a bit hoping it would fix it in https://github.com/qdm12/dns/commit/1008c6d0ee4f63f981929663161f11903c090c10

If not, I'll change the base image to be alpine so that you can have a shell in there since there isn't one for now.

Kampe commented 2 years ago

Unfortunately I don't see any log lines at all when this happens that would indicate what's going on. I'll toss on debug logs and see if we can't catch it.

Also I have blocklist updates off now I believe with UPDATE_PERIOD=0 - what's strange is when this does occur, DNS will never propagate again without manual intervention, so if it's a restart there seems to be an issue on start after shutdown?

qdm12 commented 2 years ago

When the update period isn't zero, it's meant to restart at runtime periodically. And sometimes I've seen it hang at shutting down (rarely though) when it takes longer than 100ms. I changed it a bit to not timeout to see if it's more stable.

qdm12 commented 2 years ago

Does the issue still happen now? :thinking: It's running fine (amd64) for me so far.

Kampe commented 2 years ago

I dont see it happening anymore on mine :)

Sorry I didnt see you asked this!

qdm12 commented 2 years ago

Nice let's close this!

qdm12 / dns

DNS resolution failed until restart of container #93