Closed WiredLife closed 4 years ago
I have tried with Cloudflared and stubby, Both will crash for me, but I can't get any useful logs from them... I may install dnscrypt in a bit and try.
dschaper This may be an externally triggered issue but we'd like to get as much info as we can to prevent others mistakes from causing us faults.
Thanks for all your reports. Pi-hole's FTL daemon still runs the last released version of dnsmasq
, namely v2.80
. If you are participating in the public Pi-hole v5.0 beta testing, I may have a solution for you.
When you're already running the beta (or just checked it out now), simply run
pihole checkout ftl update/dnsmasq
to get the most-recent dnsmasq
code featuring a great bunch of DNSSEC-related tweaks and fixes.
Note that, as this code is based on the bleeding-edge development version of dnsmasq
, it may have its own problems. However, I'd be very interested in whether it would resolve this issue because, in this case, we do not need to debug the dnsmasq
code for an unknown bug when it may be already fixed. So if someone is willing to try it out, I'd highly appreciate any reports.
Thanks for using the Pi-hole and helping us make Pi-hole a better software for us all! You are a great community and we would be doing much worse without all your continued input!
I am running 5 on one of mine, ran the checkout... ATM it seems stable, will report back shortly.
Thanks for all your reports. Pi-hole's FTL daemon still runs the last released version of
dnsmasq
, namelyv2.80
. If you are participating in the public Pi-hole v5.0 beta testing, I may have a solution for you.When you're already running the beta (or just checked it out now), simply run
pihole checkout ftl update/dnsmasq
to get the most-recent
dnsmasq
code featuring a great bunch of DNSSEC-related tweaks and fixes.Note that, as this code is based on the bleeding-edge development version of
dnsmasq
, it may have its own problems. However, I'd be very interested in whether it would resolve this issue because, in this case, we do not need to debug thednsmasq
code for an unknown bug when it may be already fixed. So if someone is willing to try it out, I'd highly appreciate any reports.Thanks for using the Pi-hole and helping us make Pi-hole a better software for us all! You are a great community and we would be doing much worse without all your continued input!
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:
https://discourse.pi-hole.net/t/pihole-dns-service-crashing/29098/2
@DL6ER Oh wait ! I remember having kinda the same issue (dnsmasq segfaulting) with dnssec enabled on my Unifi Gateway. I ended up disabling it since it forwards requests to the PiHoles anyways...
I'd bet it's the culprit !
I don't see any updates on the Cloudflare twitter thread. Has anyone checked to see if this has been addressed? I'm afraid to re-enable DNSSEC remotely from work.
I just enabled dnssec on my backup pihole again, and it is still crashing.
pihole checkout ftl update/dnsmasq
This is so far working on my instance. No crashing.
This is with DNSSEC enabled
I don't see any updates on the Cloudflare twitter thread.
Cloudflare is located in the same metro area that has a current COVID-19 state of emergency, a Google employee tested positive, and both Amazon and Microsoft have told their employees to work from home. Combined with the fact that this issue seems to affect only a very small fraction of Cloudflare's paying users, I suspect it's not a front burner issue.
I think it's most likely the fix will have to come from the 5.0 beta that folks are currently having success with. The Cloudflare fix might be a (much) longer way off.
I have tried to reproduce with dnsmasq 2.80 http://www.thekelleys.org.uk/dnsmasq/dnsmasq-2.80.tar.gz 9e4a58f816ce0033ce383c549b7d4058ad9b823968d352d2b76614f83ea39adc and fixed an issues which caused some retries due to the CD bit handling yesterday, but couldn't get it to crash the same way as described here. The only difference between 8.8.8.8 and 1.1.1.1 I see is padding size, so I wonder if that's possibly the problem and dnsmasq has trouble handling it. It would be helpful to capture queries causing this.
As has been reported before, the crash happens at
https://github.com/pi-hole/FTL/blob/b60d63f448179cb139755eb8dcfd9d5335df3fd4/dnsmasq/forward.c#L313
see @antila's comment here for further details why it crashed. I vaguely remember having seen the exact same crash before and, if I'm not mistaken, it might have even been me reporting this to dnsmasq
upstream. This might have been months ago.
Does it crash for anyone running the experimental update/dnsmasq
branch? It looks like the dnsmasq
release might be close so we may be able to get the most recent dnsmasq
version released just in time to ship it with Pi-hole v5.0. All users will then benefit from getting the most recent version likely months before the other sources update to it.
I seems I was right with my assumption having seen the exact same crash before. Together with @TC1977 and Simon Kelley we worked out the fix which is included in update/dnsmasq
though https://github.com/pi-hole/FTL/commit/f6aff056775c029e32d51b348fa998a046d086f1
Yea, Can confirm that update/dnsmasq does seem stable, been running it atleast a day.
@vavrusa If you're curious and want to see a query which breaks - see attached pcap. This is taken from Stubby configured to use 1.1.1.1 DoT.
@madpsy thanks, that's super helpful! I'll see if I can reproduce and find a workaround until the fixed dnsmasq package is released.
@madpsy any chance you could try now and see if you still experience crashes?
@vavrusa Looking good!
@DL6ER I just reviewed this issue after getting tagged by you, and checked out my setup again.
Shortly after I closed #645, I actually went back to my entire original config that was causing the crashes - DNSSEC, dnscrypt-proxy, using Cloudflare alternating with ventricle.us and even doh-crypto-sx (the originally problematic server) on occasion, and have had absolutely no problems over the last couple of days. The one thing that's different is that I switched to a new ISP, which doesn't have IPv6 - so I removed ::1
from the list of custom DNS servers in the Pi-hole settings.
pihole -d
gives no issues (other than IPv6 not working),journalctl -u dnscrypt-proxy
shows no recent problems,grep "RESPONSE_ERROR" /var/log/dnscrypt-proxy/query.log
gives no response errors,So perhaps the error is now being triggered by Cloudflare in some different way (IPv6?), that isn't affecting me. Anyway, hope the work you did way back when is paying off for people now.
The common thing between cloudflare
, ventricle.us
and doh-crypto-sx
is that they support padding.
Also, not only they support padding, but they also return padded responses for queries over DoH even if the query wasn't padded (some people may qualify that behavior as "not right", but it can only improve security).
@jedisct1 I assumed it was padding as well as that's the only difference, but couldn't reproduce it. It looks like dnsmasq crashes upon receiving REFUSED response (I've managed to reproduce that). I did some digging and some portion of frequent DNSKEY queries could have been throttled as part of abuse traffic in some PoPs for the last few days, particularly if it's coming from shared prefixes. I've added an exception, so this shouldn't be happening anymore, so it'd be great if more people could confirm.
@vavrusa Sorry for the late delay, my box that was running 4 needed to be rebuilt, its up and looks stable so far.
@vavrusa Has been stable for me since you made the change at cloudflare's side too. Thanks.
I can confirm it also. I re-enabled DNSSEC about 4 hours ago and I didn't have any issues since.
great if more people could confirm.
FTL v4.3.1 + dnscrypt-proxy
2.0.39 + Cloudflare upstream here.
@vavrusa I re-enabled DNSSEC in Pi-hole this morning and it ran just fine for ~2h before I left for work. I'll consider the bug fixed if it lasts 24h without a crash (though I don't see why it shouldn't.) Thanks so much!
Re-enabled DNSSEC about 2 hours ago. My 21 clients are happy again. Thank you all.
@madpsy any chance you could try now and see if you still experience crashes?
Seems to be working again. Thanks!
Everything's been working now for over 24 hours, so I'd say this issue is resolved.
Everything's been working now for over 24 hours, so I'd say this issue is resolved.
Is this back on the main 5.0 branch or the feature branch noted above?
See my comment before the one you're replying to :)
Thanks for all your input. This bug seems to have been fixed in two independent ways:
update/dnsmasq
branch got the fix on our sideWe'll absorb our fix in the main code as soon as dnsmasq v2.81
is released. This will hopefully not take long, they have already set up a release candidate. It fails to compile on FreeBSD (a platform we don't support). A patch has already been worked out and submitted so we can expect a second release candidate, soon.
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:
Still occurring. Just had 2 pi-holes crash within 15 mins of each other. v5. Where exactly is the fix?
@biship Still working on v5 over here with dnscrypt-proxy and CloudFlare upstream. What's your upstream?
FWIW the fix was made on CloudFlare's side over 2 months ago.
1.1.1.2 1.0.0.2
I'm not using dnscrypt-proxy, whatever that is.
OK so we both use CloudFlare. Still up and running here ...
it's happened twice in the last week. if it becomes regular i'll have to turn on debug logs and open a new issue.
I just experienced this issue when enabling DNSSEC on an otherwise working setup. After some brief debugging, it seems that FTL does not gracefully transition between certain upstream DNS configurations. Leaving DNSSEC and restarting my docker container seems to have handled it. I also tested switching between some other combinations of DNS configurations and it really struggles. Seems like FTL just shuts down and doesn't bother trying to come up...
[2022-03-23 00:23:33.042 871M] Shutting down... [2022-03-23 00:23:33.305 871M] Finished final database update (stored 55 queries) [2022-03-23 00:23:33.305 871M] Waiting for threads to join [2022-03-23 00:23:33.305 871M] Thread telnet-IPv4 (0) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread telnet-socket (2) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread database (3) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread housekeeper (4) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread DNS client (5) is idle, terminating it. [2022-03-23 00:23:33.305 871M] All threads joined [2022-03-23 00:23:33.306 871M] ########## FTL terminated after 6m 24s (code 0)! ##########
And nothing thereafter, but saving again (even with the same configuration which it just failed to start) will start FTL again.
In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:
How familiar are you with the codebase?:
1
[BUG | ISSUE] Expected Behaviour:
[BUG | ISSUE] Actual Behaviour: pi-hole on my 2 completly different systems crashed @ nearly the same time.
[BUG | ISSUE] Steps to reproduce:
-
-
Log file output [if available] RPi4 Log:
PC Log:
Device specifics
Hardware Type: RPi4 4GB and a PC OS: newest Raspbian on RPi4 and Ubuntu Server on PC
This template was created based on the work of
udemy-dl
.