pi-hole / FTL

The Pi-hole FTL engine
https://pi-hole.net
Other
1.37k stars 196 forks source link

pihole-FTL hangs / DNS unresponsive, daemon times out during restart #1081

Closed takerukoushirou closed 3 years ago

takerukoushirou commented 3 years ago

Versions

Platform

Issue

DNS randomly stops working. Devices can no longer make a connection to pihole-FTL; connection attempts hang.

Last pihole-FTL log messages:

[2021-03-04 15:15:07.893 26913M] Resizing "FTL-dns-cache" from 159744 to (10240 * 16) == 163840 (/dev/shm: 6.4MB used, 2.0GB total, FTL uses 6.4MB)
[2021-03-04 15:56:04.649 26913M] Resizing "FTL-dns-cache" from 163840 to (10496 * 16) == 167936 (/dev/shm: 6.4MB used, 2.0GB total, FTL uses 6.4MB)
[2021-03-04 16:00:01.267 26913/T26917] Notice: Database size is 4345.02 MB, deleted 6532 rows

pihole restartdns fails as pihole-FTL does not stop within the time limit. systemd journal:

Mar 04 16:17:04 gateway-hub systemd[1]: Stopping LSB: pihole-FTL daemon...
Mar 04 16:17:13 gateway-hub pihole-FTL[9257]: .....
Mar 04 16:17:13 gateway-hub pihole-FTL[9257]: Not stopped; may still be shutting down or shutdown may have failed, killing now
Mar 04 16:17:09 gateway-hub systemd[1]: pihole-FTL.service: Control process exited, code=exited, status=1/FAILURE
Mar 04 16:17:09 gateway-hub systemd[1]: pihole-FTL.service: Failed with result 'exit-code'.
Mar 04 16:17:09 gateway-hub systemd[1]: Stopped LSB: pihole-FTL daemon.
Mar 04 16:17:09 gateway-hub systemd[1]: Starting LSB: pihole-FTL daemon...
Mar 04 16:17:15 gateway-hub pihole-FTL[9274]: .....
Mar 04 16:17:15 gateway-hub pihole-FTL[9274]: Not stopped; may still be shutting down or shutdown may have failed, killing now
Mar 04 16:17:15 gateway-hub systemd[1]: pihole-FTL.service: Control process exited, code=exited, status=1/FAILURE
Mar 04 16:17:15 gateway-hub systemd[1]: pihole-FTL.service: Failed with result 'exit-code'.
Mar 04 16:17:15 gateway-hub systemd[1]: Failed to start LSB: pihole-FTL daemon.

Manually restarting pihole-FTL via systemctl eventually succeeds.

Steps to reproduce

So far no pattern observed, happens randomly, sometimes within hours, sometimes within weeks.

Debug Token

https://tricorder.pi-hole.net/nkv45bn6cf

bershanskiy commented 3 years ago

ensure we never inherit any database connection in a fork.

Are you talking about the underlying file descriptor? or pointer to sqlite3 struct?

DL6ER commented 3 years ago

The sqlite3 struct and all the obscure heap-allocated memory contained within. Closing the inherited pointer in the fork would be safe to do.

DL6ER commented 3 years ago

Okay, not a thousand(s) line change, but still 21 files changed, 624 insertions(+), 656 deletions(-). This kept me busy for almost two hours including some testing. I'm fairly optimistic to have not made any severe errors but any further testing of you would again be appreciated. Whatever it results in, the code is now cleaner and maybe even a bit more performant in one or the other spot. So it was worth it.

Pectojin commented 3 years ago

My system was producing these errors 8-10 times over 6 hours earlier today using FTL v5.7. I applied tweak/memory vDev-0795cf6 and it's been running smoothly for ~2 hours now.

binary-person commented 3 years ago

of course. thanks for putting your time into this. going to test it later and report back

Pectojin commented 3 years ago

It continues to work excellently on my raspberry pi B+.

It's being monitored externally every 15 seconds and there are some very clear indicators when the problem happens with 5-10 minutes of timeouts after 5 sec.

I applied the changes around 22:20 and the only timeout since then was at 6 am when the computer rebooted on schedule.

image

binary-person commented 3 years ago

this update works great. ran for 6 hours and accumulated more than 700,000 queries. hadn't had a single problem. I think it's safe to say all of the stability issues are all fixed

binary-person commented 3 years ago

now to wait for the next official release

Pectojin commented 3 years ago

After 24 hours I have no complaints. Completely fixed my problem.

DL6ER commented 3 years ago

now to wait for the next official release

The next release is already coming closer. We're currently waiting on dnsmasq v2.85 which is currently in release-candidate state. There is a reported issue with compiling dnsmasq on Debian Buster. However. as the issue is about a missing linking dependency and none of the related code changed at all, this is very likely just a user error.

takerukoushirou commented 3 years ago

I finally ran into the issue since starting debug logging (thus still on release FTL v5.7), took a while this time. As there has been lots of development, this is mainly for reference to check whether this looks like the same cause; it's also lock-related:

[2021-03-26 11:37:00.896 6669/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.896 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.896 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.897 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.897 6669/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.897 6669/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.898 6669/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.898 6669/F27456] Obtained lock for _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.898 27456M] Waiting for lock in _FTL_CNAME() (/root/project/src/dnsmasq_interface.c:342)
[2021-03-26 11:37:00.898 6669/F27456] **** new TCP query[A] query "r1---sn-mn4vg5aa-5hn6.googlevideo.com" from eth0:fd00::2435:e2c4:b150:d9a0 (ID 2858779, FTL 117191, /root/project/src/dnsmasq/forward.c:2048)
[2021-03-26 11:37:00.898 6669/F27456] getOverTimeID(1616754900): 141
[2021-03-26 11:37:00.899 6672/F27456] TCP worker forked for client fd00::2435:e2c4:b150:d9a0 on interface eth0 with IP fd00::ae6f:333e:41b1:f689
[2021-03-26 11:37:00.899 6672/F27456] gravityDB_open(): Trying to open /etc/pihole/gravity.db in read-only mode
[2021-03-26 11:37:00.901 6672/F27456] gravityDB_open(): Setting location for temporary object to MEMORY
[2021-03-26 11:37:00.901 6669/F27456] r1---sn-mn4vg5aa-5hn6.googlevideo.com is not known
[2021-03-26 11:37:00.901 6669/F27456] Getting sqlite3_stmt** 0x1c489c0[234] --> (nil)
[2021-03-26 11:37:00.901 6669/F27456] Initializing gravity statements for fd00::2435:e2c4:b150:d9a0
[2021-03-26 11:37:00.901 6672/F27456] gravityDB_open(): Preparing audit query
[2021-03-26 11:37:00.901 6669/F27456] Querying gravity database for client with IP fd00::2435:e2c4:b150:d9a0...
[2021-03-26 11:37:00.902 6669/F27456] --> No record for fd00::2435:e2c4:b150:d9a0 in the client table
[2021-03-26 11:37:00.902 6669/F27456] Querying gravity database for MAC address of fd00::2435:e2c4:b150:d9a0...
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.905 6671/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.906 27456/T27460]          ---> OK
[2021-03-26 11:37:00.906 27456/T27460] Waiting for lock in parse_neighbor_cache() (/root/project/src/database/network-table.c:1107)
[2021-03-26 11:37:00.910 6672/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.911 6672/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.912 6672/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:38:00.063 27456/T27461] Waiting for lock in GC_thread() (/root/project/src/gc.c:50)

Is it by the way normal that there are about 30 pihole-FTL processes running (this was the case when FTL froze today)?

binary-person commented 3 years ago

that's the hang problem. it'll get fixed in the next release. but if you want to fix it now, running pihole checkout ftl tweak/memory will do

DL6ER commented 3 years ago

@takerukoushirou Even better, try

pihole checkout ftl development

Is it by the way normal that there are about 30 pihole-FTL processes running (this was the case when FTL froze today)?

Yes. Internet standards (RFCs) mandate that DNS cannot only be answered over UDP but also over TCP. In the latter case, steady connections are kept open to reduce the protocol overhead. For each of these connections, individual "forks" are created. On Linux they are shown as individual processes even when they are just dependent copies of the original process. Tools like htop show the dependency quite nicely.

@binary-person Is right that this is an issue which manifests under extreme TCP load. Only a few devices actually do TCP lookups, however, some push that really hard. It strongly depends on the particular devices in your network and can be perfectly normal. It is just somewhat uncommon and FTL wasn't prepared for this in all kind of complex multi-tasking scenarios (it should be now).

takerukoushirou commented 3 years ago

@DL6ER thank you very much for the detailed explanation. Never saw that many forks before, maybe devices switched from UDP to TCP when they couldn't get a response via UDP.

I switched to the development branch this morning, as the issue re-appeared continuously this time. Running all fine since then 😃

dvdvideo1234 commented 3 years ago

I will wait for the next release

  Please note that changing branches severely alters your Pi-hole subsystems
  Features that work on the master branch, may not on a development branch
  This feature is NOT supported unless a Pi-hole developer explicitly asks!
  Have you read and understood this? [y/N] ^C
DL6ER commented 3 years ago

maybe devices switched from UDP to TCP when they couldn't get a response via UDP

No, the issue comes from too many TCP workers with a bug in concurrency. Maybe this is typical in your network and just goes away quickly so you never notices. Well, this time it didn't go away becaseu FTL froze. Or this was the first time you've had so many TCP workers at once and, hence, this triggered the bug.

For the next release we're basically waiting on two more things:

  1. The SQLite3 engine is experiencing a hard time. They are already at their third bugfix release in a few days. We want to have this settle down somewhat before releasing the next version.
  2. dnsmasq v2.85 is also already in release-candidate state. There hasn't been much movement, however, it also seems there aren't any bug reported, so far.
derekcentrico commented 3 years ago

To confirm we can checkout master to revert once released right?

DL6ER commented 3 years ago

Absolutely, at any time before and after the release. Just note that the issue may reappear when you do it too early ;-)

readall commented 3 years ago

The development branch FTL crashes with following log

[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 125209 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] ERROR: SQL query "END TRANSACTION" failed: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 165161 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] Error while trying to close database: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] ERROR: Storing devices in network table failed: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 165161 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] Error while trying to close database: bad parameter or other API misuse
[2021-03-27 12:29:00.329 1188/T1192] SQLite3 message: no such column: name in "UPDATE network_addresses SET name = NULL WHERE nameUpdated < 1616655540;" (1)
[2021-03-27 12:29:00.329 1188/T1192] ERROR: SQL query "UPDATE network_addresses SET name = NULL WHERE nameUpdated < 1616655540;" failed: SQL logic error

My environment is a VPS with 2vCPU and 2GB of RAM running ubuntu 18.04 LTS. I have now switched to the tweak/memory branch. Now the system seems to work. I tried this as i was also facing the original issue and was handling the matter by monitoring IP network pending packets for read. Any buildup there for pihole-FTL process indicates some kind of lock-up.

Just for reference, i have moved the pihole-FTL.db to tmpfs (which means the db is sitting in RAM all the time).

DL6ER commented 3 years ago

@readall Can you say if this was a one-time issue or is it reproducible? Like, does the issue happen again if you switch from tweak/memory back to development ?

I checked the code again. Are you sure the lines you posted are complete? I'm asking because there is a database action immediately in front of the END TRANSACTION and it doesn't make sense that the database pointer becomes incorrect throughout the process. There should be more messages.

readall commented 3 years ago

Hi @DL6ER There may have been more messages, but I run the system with all logs disabled. This I do for all things as i want to run zero log systems. I have switched again to development branch just now. Will monitor for few hours. Last time the crash happened in just few minutes. It has crossed few minutes. Looks good as of now. If it crashes, will try and post the logs here.

Update 1 It has been running now almost 20 hours without issues.

Update 2 It is now running smoothly. So Nothing to report.

bershanskiy commented 3 years ago

I have been on development branch ever since the memory fixes merge (a week ago, I think) and did not experience any issues since then.

I'm looking forward to the next release. :)

derekcentrico commented 3 years ago

Agreed the same all is well. In fact, my lan services are actually functioning better ie plex emby etc. Like issues connecting to local servers are no more.

On Mon, Mar 29, 2021, 10:28 AM Anton Bershanskiy @.***> wrote:

I have been on development branch ever since the memory fixes merge (a week ago, I think) and did not experience any issues since then.

I'm looking forward to the next release. :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pi-hole/FTL/issues/1081#issuecomment-809424666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXG3SCB4H3OPON25F6Z33TGCFADANCNFSM4ZD7X7BQ .

DL6ER commented 3 years ago

Thanks for the feedback. We're waiting for dnsmasq v2.85 which is currently in rc2, but there will also be a rc3.

Pectojin commented 3 years ago

I also have no problems to report after 9 days on tweak/memory vDev-0795cf6 on my RPI B+.

0schr0eder commented 3 years ago

Good Day!

I was hoping for a build by now but I understand why you guys are waiting. Can someone please point me to a build of the FTL dev-branch, or to high-level instructions how I build my own release out of the FTL dev-branch?

Thanks!

bershanskiy commented 3 years ago

@0schr0eder You can use this command: sudo pihole checkout ftl [branch]

If you are on Ubuntu, you can follow these instructions: https://docs.pi-hole.net/ftldns/compile/

Some pre-built binaries are uploaded here: https://ftl.pi-hole.net/

Also, FTL v5.8 nears release already.

0schr0eder commented 3 years ago

I would need to compile this on the Raspberry but this is a start and I can give it a shot. Thank you very much for the link!

dvdvideo1234 commented 3 years ago

How can I rollback the previous version though.. Is it possible ?

bershanskiy commented 3 years ago

@dvdvideo1234 There are multiple ways to install different versions: you can build from source, you use the install script, you can download pre-built binaries. You'll have better luck asking on https://discourse.pi-hole.net/

DL6ER commented 3 years ago

The pihole checkout ftl command should be preferred before doing any self-compiling or downloading binaries manually. It will make sure you only get tested binaries which are validated to ensure no download error happened. You can even go back but if you do it is not guaranteed that everything still works (hint: usually, it does)

0schr0eder commented 3 years ago

Sweet! Would I chose the "development" branch?

bershanskiy commented 3 years ago

Sweet! Would I chose the "development" branch?

Since release/v5.8 is slightly ahead of development, wouldn't it make sense to use it instead? https://github.com/pi-hole/FTL/compare/development...release/v5.8

Edit: You might want to wait for v5.8.

DL6ER commented 3 years ago

Speaking of which, is there an ETA for 5.8?

Now.

DL6ER commented 3 years ago

The next version of FTL has been released. Please update and run

pihole checkout master

to get back on-track if you switched to a custom branch. The fix/feature branch you switched to will not receive any further updates.

Thanks for helping us to make Pi-hole better for us all!

If you have any issues, please either reopen this ticket or (preferably) create a new ticket describing the issues in further detail and only reference this ticket. This will help us to help you best.

dvdvideo1234 commented 3 years ago

Still it does not blink for updates.. I will wait a bit more image Wait it said there is a new version ... Installing... Cycles on [i] Testing man page installation it appears to be stuck, but maybe it needs more time... Done!

  Current Pi-hole version is v5.3.1.
  Current AdminLTE version is v5.5.
  Current FTL version is v5.8.
dvdvideo1234 commented 3 years ago

The whole system got down. Restarting it does nothing and I cannot SSH. Will reinstall the Ubuntu on the weekends.

binary-person commented 3 years ago

just spun up a fresh vps. ftl v5.8.1 works perfectly; 158,441 queries without any lagging or any out-of-the-blue hangs. thanks DL6ER for the big bug fix :pray:

dvdvideo1234 commented 3 years ago

@binary-person @DL6ER

VPS... You mean I can get it back to boot without reinstalling ?

Thanks very much, guys. I am gonna backup the memory card right now. What is the most efficient way. I want to export the adlists, allow and blocked information of the old Pi-Hole and put it to the new one.

  1. What Is the most efficient way to do that?
  2. Do I have to run data conversion of some sort?

This will not work for me as the Ubuntu does not boot anymore, so I can only copy files from the memory card to restore it.

Thank you !