Closed alexandre-abrioux closed 2 years ago
Hey there! I'm glad you're having fun with it.
I had a similar issue where it would not manage to resolve my own dns record (empty cached response) and my ddns container would update the record like a madman.... Not sure if that's related.
The code adds to the lru cache here: https://github.com/qdm12/dns/blob/master/pkg/cache/lru/lru.go#L34
And it doesn't check if the response is empty. I will add that. The cache get function however respects the ttl of the record. So if it's empty and got updated, it will get evicted from the cache once it's set ttl is over (usually 1hr).
Empty response (or response will N empty records) will not be inserted into the cache with commit 70229f5b1644416ee99310e653e110b00a33c9bc
Let me know if you encounter the problem again. Enjoy! π
Wow, that was fast! Thank you for your help :)
A bit too fast, commit a750d22112a333afae04b702933741f9fed18556 overwrites it since one my test was failing π Image should be built soon.
Assuming this is added to the latest build, I'm seeing the same issues still with version v2.0.0-beta built on 2022-03-31T18:11:07Z (commit ad19e23)
After about a week of functioning normally and properly, this begins happening. Have now seen it on two different raspi hosts running bullseye. Solution currently is to restart the container as the ServFails just persist indefinitely.
2022/06/22 16:27:10 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 16:32:38 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:32:52 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:35:29 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 16:38:05 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 16:40:50 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 17:03:46 WARN cannot dial DoT server: dial tcp 8.8.8.8:853: i/o timeout
2022/06/22 17:04:49 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 17:14:53 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 19:20:55 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/22 21:14:33 WARN cannot exchange over DoT connection: read tcp 192.168.144.2:36378->8.8.4.4:853: i/o timeout
2022/06/22 22:29:00 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 22:36:28 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 22:37:27 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 23:07:55 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 23:27:55 WARN cannot dial DoT server: dial tcp 8.8.4.4:853: i/o timeout
2022/06/22 23:52:42 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/22 23:54:19 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 00:02:54 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/23 00:29:55 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
2022/06/23 00:31:14 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 00:35:58 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:05:01 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:15:40 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:27:02 WARN cannot dial DoT server: dial tcp 1.0.0.1:853: i/o timeout
2022/06/23 02:48:47 WARN cannot dial DoT server: dial tcp 1.1.1.1:853: i/o timeout
@Kampe maybe it loses connectivity for another reason? I have it running on amd64 for weeks without issue here π€ Maybe try setting up the prometheus + grafana dashboard https://github.com/qdm12/dns/tree/v2.0.0-beta/readme/metrics to see why/how it fails?
Alright got it all setup to monitor, didnt know you had added metrics! I'll keep an eye out and relay what I find here :)
Thanks!
Well, found the issue is with the cache - happened much faster this time and just started getting misses.
Any ideas?
The cache is a consequence of the dns server not working anymore I would say. Although i don't know why the cache entries went to zero, have you rebooted the container at 4:50?
Also why do you get such high dns traffic from 4:50 once the dns server doesn't work anymore? Is it things retrying hard?
The problem is really those i/o timeout
. Try increasing
DOT_TIMEOUT=5s
? But my guess something else on your system/router/ISP is blocking outbound traffic to 853. Those timeouts usually happen when the connection is down or packets get dropped.
I'm running the last v2 beta image on my server (amd64) though, so I'll let you know if I get the same problem.
well, heh, through my make commands I just pulled your latest updates to the v2.0.0-beta and no longer get any response from the server whatsoeever unfortunately :(
as for the spike, yeah I believe that's my entire network retrying their queries, I also front the server with pihole
I'll add the timeout and see what I get
here's my updated config:
cloudflared:
container_name: cloudflared
image: qmcgaw/dns:v2.0.0-beta
environment:
- LOG_LEVEL=info
- BLOCK_MALICIOUS=off
- BLOCK_SURVEILLANCE=off
- BLOCK_ADS=off
- CHECK_DNS=off
- UPSTREAM_TYPE=DoT
- DOT_RESOLVERS=cloudflare,google
- DOH_RESOLVERS=cloudflare,google
- DNS_FALLBACK_PLAINTEXT_RESOLVERS=cloudflare
- LISTENING_ADDRESS=:5054
- UPDATE_PERIOD=48h
- METRICS_TYPE=prometheus
- METRICS_PROMETHEUS_ADDRESS=:9091
- DOT_TIMEOUT=5s
restart: always
cap_add:
- NET_ADMIN
- NET_RAW
ports:
- 5054:5054/tcp
- 5054:5054/udp
Those graphs were from normal operation without any intervention at all, so the container was not restored until right before I screenshotted those.
well, heh, through my make commands I just pulled your latest updates to the v2.0.0-beta and no longer get any response from the server whatsoeever unfortunately :(
Recent images from today were broken (oops my bad but it's beta), try repulling it the last image v2.0.0-beta should be working π
I'm aiming at releasing a stable v2 this weekend-ish (after a year lol), so I'm definitely interested if it's unstable.
All good, beta it is! It doesnt seem like my listening ports are being respected anymore?
βββ DNS upstream connection: dot
βββ DNS server listening address: :53
βββ DNS over TLS:
| βββ DNS over TLS providers: cloudflare and google
| βββ Plaintext fallback DNS providers: cloudflare
| βββ Request timeout: 5s
| βββ Connecting over: IPv4
also I attempted to set the update_period to 0 and I get yelled at come run time
Also did cache metrics receive a change? None seem to appear in the dashboard anymore
Ok so I broke a few things when doing #111 recently, fixed them in this build in progress:
0
I'll monitor my own dashboard see if I can spot any oddities as well.
Pulled the latest and can verify all the above are working again! Thank you very much!
I'll keep a close eye on performance and reference this chat again if needed!
So, just had the issue crop up again this evening with the very latest build unfortunately, I have two pis running retrieving the exact same request loads, one still running just fine while the other is not responding any longer, they've both been running fine for about 48 hours now.
Logs full of:
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://dns.google/dns-query": context deadline exceeded
2022-06-27T05:07:37Z WARN cannot exchange over DoH connection: Post "https://cloudflare-dns.com/dns-query": context deadline exceeded
Heres the metrics from the container thats still functioning normally, you'll notice the spike there when the other went down.
To be fair, this "functioning" container's logs are full of the very same context deadline exceeded
log outs as the "no-longer functioning" container
Next time this happens is there anything on the container you want me to shell in and poke around at?
Are you sure the last log line from the failed container isn't about the DNS server restarting? I had the bug recently where the server would hang at shutting down before restarting to get updated block lists. I cannot reproduce it in development though, but I changed the shutdown a bit hoping it would fix it in https://github.com/qdm12/dns/commit/1008c6d0ee4f63f981929663161f11903c090c10
If not, I'll change the base image to be alpine so that you can have a shell in there since there isn't one for now.
Unfortunately I don't see any log lines at all when this happens that would indicate what's going on. I'll toss on debug logs and see if we can't catch it.
Also I have blocklist updates off now I believe with UPDATE_PERIOD=0
- what's strange is when this does occur, DNS will never propagate again without manual intervention, so if it's a restart there seems to be an issue on start after shutdown?
When the update period isn't zero, it's meant to restart at runtime periodically. And sometimes I've seen it hang at shutting down (rarely though) when it takes longer than 100ms. I changed it a bit to not timeout to see if it's more stable.
Does the issue still happen now? :thinking: It's running fine (amd64) for me so far.
I dont see it happening anymore on mine :)
Sorry I didnt see you asked this!
Nice let's close this!
Hi! First of all, thank you for your amazing work, I'm a big fan of your project.
I have a question for you: does the LRU cache saves empty responses? I'm asking this because something strange happened yesterday.
FYI I upgraded to
v2.0.0-beta
about ~2 weeks ago. Everything was working fine until yesterday.For context, some of my services resolve very often (like once every second) a specific hostnameβlet's call it acme.com. So they are basically hitting the LRU most of the time, if not all the time.
Yesterday that hostname couldn't get resolved anymore by my services for some reason. After I noticed the issue I tried:
But at the same time I could resolve google.com
Once I restarted the DNS container, the hostname could get resolved once again right away.
My config is very simple:
Those are the last logs that got written before I restarted my DNS container. I got some warnings but I don't think they are important. FYI the issue started the 28th around 17:30.
Could it be possible that the DNS server saved an empty resolution in the LRU during one of those timeouts and then kept serving that empty resolution forever to my services? What do you think?
Thank you very much for your help.