FTL Crash issues? Read this thread first!

WiredLife commented 4 years ago

In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:

[X] I have read and understood the contributors guide.
[X] The issue I am reporting can be replicated
[X] The issue I am reporting isn't a duplicate

How familiar are you with the codebase?:

1

[BUG | ISSUE] Expected Behaviour:

[BUG | ISSUE] Actual Behaviour: pi-hole on my 2 completly different systems crashed @ nearly the same time.

[BUG | ISSUE] Steps to reproduce:

-

Log file output [if available] RPi4 Log:

[2020-03-04 00:18:39.277 4096] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-04 00:18:39.277 4096] ---------------------------->  FTL crashed!  <----------------------------
[2020-03-04 00:18:39.277 4096] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-04 00:18:39.277 4096] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-03-04 00:18:39.278 4096] and include in your report already the following details:
[2020-03-04 00:18:39.278 4096] FTL has been running for 106087 seconds
[2020-03-04 00:18:39.278 4096] FTL branch: master
[2020-03-04 00:18:39.278 4096] FTL version: v4.3.1
[2020-03-04 00:18:39.278 4096] FTL commit: b60d63f
[2020-03-04 00:18:39.278 4096] FTL date: 2019-05-25 21:37:26 +0200
[2020-03-04 00:18:39.278 4096] FTL user: started as pihole, ended as pihole
[2020-03-04 00:18:39.278 4096] Received signal: Segmentation fault
[2020-03-04 00:18:39.278 4096]      at address: 0
[2020-03-04 00:18:39.278 4096]      with code: SEGV_MAPERR (Address not mapped to object)
[2020-03-04 00:18:39.279 4096] Backtrace:
[2020-03-04 00:18:39.279 4096] B[0000]: /usr/bin/pihole-FTL(+0x1a25c) [0x47125c]
[2020-03-04 00:18:39.279 4096] B[0001]: /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0xb6d4c130]
[2020-03-04 00:18:39.279 4096] B[0002]: /usr/bin/pihole-FTL(+0x32798) [0x489798]
[2020-03-04 00:18:39.279 4096] B[0003]: /usr/bin/pihole-FTL(receive_query+0x5d1) [0x48a4ce]
[2020-03-04 00:18:39.279 4096] B[0004]: /usr/bin/pihole-FTL(+0x40ed6) [0x497ed6]
[2020-03-04 00:18:39.279 4096] B[0005]: /usr/bin/pihole-FTL(main_dnsmasq+0xa3f) [0x49913c]
[2020-03-04 00:18:39.279 4096] B[0006]: /usr/bin/pihole-FTL(main+0x87) [0x46fe18]
[2020-03-04 00:18:39.279 4096] B[0007]: /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x10c) [0xb6d36718]
[2020-03-04 00:18:39.279 4096] Thank you for helping us to improve our FTL engine!
[2020-03-04 00:18:39.279 4096] FTL terminated!`

PC Log:

[2020-03-04 00:29:13.585 22510] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-04 00:29:13.585 22510] ---------------------------->  FTL crashed!  <----------------------------
[2020-03-04 00:29:13.585 22510] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-04 00:29:13.585 22510] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-03-04 00:29:13.585 22510] and include in your report already the following details:
[2020-03-04 00:29:13.585 22510] FTL has been running for 104188 seconds
[2020-03-04 00:29:13.585 22510] FTL branch: master
[2020-03-04 00:29:13.585 22510] FTL version: v4.3.1
[2020-03-04 00:29:13.585 22510] FTL commit: b60d63f
[2020-03-04 00:29:13.585 22510] FTL date: 2019-05-25 21:37:26 +0200
[2020-03-04 00:29:13.585 22510] FTL user: started as pihole, ended as pihole
[2020-03-04 00:29:13.585 22510] Received signal: Segmentation fault
[2020-03-04 00:29:13.585 22510]      at address: 0
[2020-03-04 00:29:13.585 22510]      with code: SEGV_MAPERR (Address not mapped to object)
[2020-03-04 00:29:13.586 22510] Backtrace:
[2020-03-04 00:29:13.586 22510] B[0000]: /usr/bin/pihole-FTL(+0x255e5) [0x55d1fce755e5]
[2020-03-04 00:29:13.586 22510] B[0001]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7fd2fa55c890]
[2020-03-04 00:29:13.586 22510] B[0002]: /usr/bin/pihole-FTL(+0x47a9a) [0x55d1fce97a9a]
[2020-03-04 00:29:13.586 22510] B[0003]: /usr/bin/pihole-FTL(receive_query+0x905) [0x55d1fce98e05]
[2020-03-04 00:29:13.586 22510] B[0004]: /usr/bin/pihole-FTL(+0x5db5b) [0x55d1fceadb5b]
[2020-03-04 00:29:13.586 22510] B[0005]: /usr/bin/pihole-FTL(main_dnsmasq+0xfdc) [0x55d1fceaf67c]
[2020-03-04 00:29:13.586 22510] B[0006]: /usr/bin/pihole-FTL(main+0xbc) [0x55d1fce73acc]
[2020-03-04 00:29:13.586 22510] B[0007]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fd2fa17ab97]
[2020-03-04 00:29:13.586 22510] B[0008]: /usr/bin/pihole-FTL(_start+0x2a) [0x55d1fce73bfa]
[2020-03-04 00:29:13.586 22510] Thank you for helping us to improve our FTL engine!
[2020-03-04 00:29:13.586 22510] FTL terminated!`

Device specifics

Hardware Type: RPi4 4GB and a PC OS: newest Raspbian on RPi4 and Ubuntu Server on PC

This template was created based on the work of udemy-dl.

MattLParker commented 4 years ago

I have tried with Cloudflared and stubby, Both will crash for me, but I can't get any useful logs from them... I may install dnscrypt in a bit and try.

dschaper This may be an externally triggered issue but we'd like to get as much info as we can to prevent others mistakes from causing us faults.

DL6ER commented 4 years ago

Thanks for all your reports. Pi-hole's FTL daemon still runs the last released version of dnsmasq, namely v2.80. If you are participating in the public Pi-hole v5.0 beta testing, I may have a solution for you.

When you're already running the beta (or just checked it out now), simply run

pihole checkout ftl update/dnsmasq

to get the most-recent dnsmasq code featuring a great bunch of DNSSEC-related tweaks and fixes.

Note that, as this code is based on the bleeding-edge development version of dnsmasq, it may have its own problems. However, I'd be very interested in whether it would resolve this issue because, in this case, we do not need to debug the dnsmasq code for an unknown bug when it may be already fixed. So if someone is willing to try it out, I'd highly appreciate any reports.

Thanks for using the Pi-hole and helping us make Pi-hole a better software for us all! You are a great community and we would be doing much worse without all your continued input!

MattLParker commented 4 years ago

I am running 5 on one of mine, ran the checkout... ATM it seems stable, will report back shortly.

Thanks for all your reports. Pi-hole's FTL daemon still runs the last released version of dnsmasq, namely v2.80. If you are participating in the public Pi-hole v5.0 beta testing, I may have a solution for you.

When you're already running the beta (or just checked it out now), simply run
pihole checkout ftl update/dnsmasq
to get the most-recent dnsmasq code featuring a great bunch of DNSSEC-related tweaks and fixes.

Note that, as this code is based on the bleeding-edge development version of dnsmasq, it may have its own problems. However, I'd be very interested in whether it would resolve this issue because, in this case, we do not need to debug the dnsmasq code for an unknown bug when it may be already fixed. So if someone is willing to try it out, I'd highly appreciate any reports.

Thanks for using the Pi-hole and helping us make Pi-hole a better software for us all! You are a great community and we would be doing much worse without all your continued input!

pralor-bot commented 4 years ago

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/pihole-dns-service-crashing/29098/2

Twanislas commented 4 years ago

@DL6ER Oh wait ! I remember having kinda the same issue (dnsmasq segfaulting) with dnssec enabled on my Unifi Gateway. I ended up disabling it since it forwards requests to the PiHoles anyways...

I'd bet it's the culprit !

darkameba commented 4 years ago

I don't see any updates on the Cloudflare twitter thread. Has anyone checked to see if this has been addressed? I'm afraid to re-enable DNSSEC remotely from work.

networkRob commented 4 years ago

I just enabled dnssec on my backup pihole again, and it is still crashing.

derekslenk commented 4 years ago

pihole checkout ftl update/dnsmasq

This is so far working on my instance. No crashing.

This is with DNSSEC enabled

jdrch commented 4 years ago

I don't see any updates on the Cloudflare twitter thread.

Cloudflare is located in the same metro area that has a current COVID-19 state of emergency, a Google employee tested positive, and both Amazon and Microsoft have told their employees to work from home. Combined with the fact that this issue seems to affect only a very small fraction of Cloudflare's paying users, I suspect it's not a front burner issue.

I think it's most likely the fix will have to come from the 5.0 beta that folks are currently having success with. The Cloudflare fix might be a (much) longer way off.

vavrusa commented 4 years ago

I have tried to reproduce with dnsmasq 2.80 http://www.thekelleys.org.uk/dnsmasq/dnsmasq-2.80.tar.gz 9e4a58f816ce0033ce383c549b7d4058ad9b823968d352d2b76614f83ea39adc and fixed an issues which caused some retries due to the CD bit handling yesterday, but couldn't get it to crash the same way as described here. The only difference between 8.8.8.8 and 1.1.1.1 I see is padding size, so I wonder if that's possibly the problem and dnsmasq has trouble handling it. It would be helpful to capture queries causing this.

DL6ER commented 4 years ago

As has been reported before, the crash happens at

https://github.com/pi-hole/FTL/blob/b60d63f448179cb139755eb8dcfd9d5335df3fd4/dnsmasq/forward.c#L313

see @antila's comment here for further details why it crashed. I vaguely remember having seen the exact same crash before and, if I'm not mistaken, it might have even been me reporting this to dnsmasq upstream. This might have been months ago.

Does it crash for anyone running the experimental update/dnsmasq branch? It looks like the dnsmasq release might be close so we may be able to get the most recent dnsmasq version released just in time to ship it with Pi-hole v5.0. All users will then benefit from getting the most recent version likely months before the other sources update to it.

DL6ER commented 4 years ago

I seems I was right with my assumption having seen the exact same crash before. Together with @TC1977 and Simon Kelley we worked out the fix which is included in update/dnsmasq though https://github.com/pi-hole/FTL/commit/f6aff056775c029e32d51b348fa998a046d086f1

MattLParker commented 4 years ago

Yea, Can confirm that update/dnsmasq does seem stable, been running it atleast a day.

madpsy commented 4 years ago

@vavrusa If you're curious and want to see a query which breaks - see attached pcap. This is taken from Stubby configured to use 1.1.1.1 DoT.

dns.out.zip

vavrusa commented 4 years ago

@madpsy thanks, that's super helpful! I'll see if I can reproduce and find a workaround until the fixed dnsmasq package is released.

vavrusa commented 4 years ago

@madpsy any chance you could try now and see if you still experience crashes?

madpsy commented 4 years ago

@vavrusa Looking good!

TC1977 commented 4 years ago

@DL6ER I just reviewed this issue after getting tagged by you, and checked out my setup again.

Shortly after I closed #645, I actually went back to my entire original config that was causing the crashes - DNSSEC, dnscrypt-proxy, using Cloudflare alternating with ventricle.us and even doh-crypto-sx (the originally problematic server) on occasion, and have had absolutely no problems over the last couple of days. The one thing that's different is that I switched to a new ISP, which doesn't have IPv6 - so I removed ::1 from the list of custom DNS servers in the Pi-hole settings.

pihole -d gives no issues (other than IPv6 not working),
journalctl -u dnscrypt-proxy shows no recent problems,
grep "RESPONSE_ERROR" /var/log/dnscrypt-proxy/query.log gives no response errors,
Pihole logs over the last couple of days doesn't show any unusual drop in queries.

So perhaps the error is now being triggered by Cloudflare in some different way (IPv6?), that isn't affecting me. Anyway, hope the work you did way back when is paying off for people now.

jedisct1 commented 4 years ago

The common thing between cloudflare, ventricle.us and doh-crypto-sx is that they support padding.

jedisct1 commented 4 years ago

Also, not only they support padding, but they also return padded responses for queries over DoH even if the query wasn't padded (some people may qualify that behavior as "not right", but it can only improve security).

vavrusa commented 4 years ago

@jedisct1 I assumed it was padding as well as that's the only difference, but couldn't reproduce it. It looks like dnsmasq crashes upon receiving REFUSED response (I've managed to reproduce that). I did some digging and some portion of frequent DNSKEY queries could have been throttled as part of abuse traffic in some PoPs for the last few days, particularly if it's coming from shared prefixes. I've added an exception, so this shouldn't be happening anymore, so it'd be great if more people could confirm.

MattLParker commented 4 years ago

@vavrusa Sorry for the late delay, my box that was running 4 needed to be rebuilt, its up and looks stable so far.

madpsy commented 4 years ago

@vavrusa Has been stable for me since you made the change at cloudflare's side too. Thanks.

ntomka commented 4 years ago

I can confirm it also. I re-enabled DNSSEC about 4 hours ago and I didn't have any issues since.

jdrch commented 4 years ago

great if more people could confirm.

FTL v4.3.1 + dnscrypt-proxy 2.0.39 + Cloudflare upstream here.

@vavrusa I re-enabled DNSSEC in Pi-hole this morning and it ran just fine for ~2h before I left for work. I'll consider the bug fixed if it lasts 24h without a crash (though I don't see why it shouldn't.) Thanks so much!

cheesedasher commented 4 years ago

Re-enabled DNSSEC about 2 hours ago. My 21 clients are happy again. Thank you all.

darkameba commented 4 years ago

@madpsy any chance you could try now and see if you still experience crashes?

Seems to be working again. Thanks!

jdrch commented 4 years ago

Everything's been working now for over 24 hours, so I'd say this issue is resolved.

derekslenk commented 4 years ago

Everything's been working now for over 24 hours, so I'd say this issue is resolved.

Is this back on the main 5.0 branch or the feature branch noted above?

jdrch commented 4 years ago

See my comment before the one you're replying to :)

DL6ER commented 4 years ago

Thanks for all your input. This bug seems to have been fixed in two independent ways:

Everyone on our update/dnsmasq branch got the fix on our side
Everyone else got the issue fixed externally (Cloudflare fixed their issue)

We'll absorb our fix in the main code as soon as dnsmasq v2.81 is released. This will hopefully not take long, they have already set up a release candidate. It fails to compile on FreeBSD (a platform we don't support). A patch has already been worked out and submitted so we can expect a second release candidate, soon.

pralor-bot commented 4 years ago

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/ftl-crash/29699/8

biship commented 4 years ago

Still occurring. Just had 2 pi-holes crash within 15 mins of each other. v5. Where exactly is the fix?

jdrch commented 4 years ago

@biship Still working on v5 over here with dnscrypt-proxy and CloudFlare upstream. What's your upstream?

FWIW the fix was made on CloudFlare's side over 2 months ago.

biship commented 4 years ago

1.1.1.2 1.0.0.2

biship commented 4 years ago

I'm not using dnscrypt-proxy, whatever that is.

jdrch commented 4 years ago

OK so we both use CloudFlare. Still up and running here ...

biship commented 4 years ago

it's happened twice in the last week. if it becomes regular i'll have to turn on debug logs and open a new issue.

cmjordan42 commented 2 years ago

I just experienced this issue when enabling DNSSEC on an otherwise working setup. After some brief debugging, it seems that FTL does not gracefully transition between certain upstream DNS configurations. Leaving DNSSEC and restarting my docker container seems to have handled it. I also tested switching between some other combinations of DNS configurations and it really struggles. Seems like FTL just shuts down and doesn't bother trying to come up...

[2022-03-23 00:23:33.042 871M] Shutting down... [2022-03-23 00:23:33.305 871M] Finished final database update (stored 55 queries) [2022-03-23 00:23:33.305 871M] Waiting for threads to join [2022-03-23 00:23:33.305 871M] Thread telnet-IPv4 (0) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread telnet-socket (2) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread database (3) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread housekeeper (4) is idle, terminating it. [2022-03-23 00:23:33.305 871M] Thread DNS client (5) is idle, terminating it. [2022-03-23 00:23:33.305 871M] All threads joined [2022-03-23 00:23:33.306 871M] ########## FTL terminated after 6m 24s (code 0)! ##########

And nothing thereafter, but saving again (even with the same configuration which it just failed to start) will start FTL again.

pi-hole / FTL

FTL Crash issues? Read this thread first! #705

-

-