pihole-FTL hangs / DNS unresponsive, daemon times out during restart

takerukoushirou commented 3 years ago

Versions

Pi-hole version is v5.2.4 (Latest: v5.2.4)
AdminLTE version is v5.4 (Latest: v5.4)
FTL version is v5.7 (Latest: v5.7)

Platform

OS and version: Debian 10.8
Platform: Raspberry Pi 4 Model B Rev 1.1, 4 GB RAM

Issue

DNS randomly stops working. Devices can no longer make a connection to pihole-FTL; connection attempts hang.

Last pihole-FTL log messages:

[2021-03-04 15:15:07.893 26913M] Resizing "FTL-dns-cache" from 159744 to (10240 * 16) == 163840 (/dev/shm: 6.4MB used, 2.0GB total, FTL uses 6.4MB)
[2021-03-04 15:56:04.649 26913M] Resizing "FTL-dns-cache" from 163840 to (10496 * 16) == 167936 (/dev/shm: 6.4MB used, 2.0GB total, FTL uses 6.4MB)
[2021-03-04 16:00:01.267 26913/T26917] Notice: Database size is 4345.02 MB, deleted 6532 rows

pihole restartdns fails as pihole-FTL does not stop within the time limit. systemd journal:

Mar 04 16:17:04 gateway-hub systemd[1]: Stopping LSB: pihole-FTL daemon...
Mar 04 16:17:13 gateway-hub pihole-FTL[9257]: .....
Mar 04 16:17:13 gateway-hub pihole-FTL[9257]: Not stopped; may still be shutting down or shutdown may have failed, killing now
Mar 04 16:17:09 gateway-hub systemd[1]: pihole-FTL.service: Control process exited, code=exited, status=1/FAILURE
Mar 04 16:17:09 gateway-hub systemd[1]: pihole-FTL.service: Failed with result 'exit-code'.
Mar 04 16:17:09 gateway-hub systemd[1]: Stopped LSB: pihole-FTL daemon.
Mar 04 16:17:09 gateway-hub systemd[1]: Starting LSB: pihole-FTL daemon...
Mar 04 16:17:15 gateway-hub pihole-FTL[9274]: .....
Mar 04 16:17:15 gateway-hub pihole-FTL[9274]: Not stopped; may still be shutting down or shutdown may have failed, killing now
Mar 04 16:17:15 gateway-hub systemd[1]: pihole-FTL.service: Control process exited, code=exited, status=1/FAILURE
Mar 04 16:17:15 gateway-hub systemd[1]: pihole-FTL.service: Failed with result 'exit-code'.
Mar 04 16:17:15 gateway-hub systemd[1]: Failed to start LSB: pihole-FTL daemon.

Manually restarting pihole-FTL via systemctl eventually succeeds.

Steps to reproduce

So far no pattern observed, happens randomly, sometimes within hours, sometimes within weeks.

Debug Token

https://tricorder.pi-hole.net/nkv45bn6cf

jfb-pihole commented 3 years ago

That's a fairly large long term database. How much free space is on the SD card?

takerukoushirou commented 3 years ago

I thought so myself, too, but then I have no idea what size other typical instance databases have. (A remote instance with fewer clients and still running Pi-hole 5.1.2 w/ FTL 5.2 reports 3082.18 MB.)

Space-wise everything is fine, also RAM usage is pretty low:

gateway-hub ~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        30G  7.8G   21G  28% /
devtmpfs        1.8G     0  1.8G   0% /dev
tmpfs           1.9G  6.2M  1.9G   1% /dev/shm
tmpfs           1.9G  436K  1.9G   1% /run
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/mmcblk0p1  253M   49M  204M  20% /boot
tmpfs           384M     0  384M   0% /run/user/999

gateway-hub ~# free -m
              total        used        free      shared  buff/cache   available
Mem:           3839         209        2516           8        1113        3501
Swap:            99           0          99

Looking at the logs since 2021-03-01, the database size has been exactly 4345.02 MB each time it was reported.

jfb-pihole commented 3 years ago

You appear to have plenty of free space.

dvdvideo1234 commented 3 years ago

Same for me. It happened 2 times already and only in the evening around 22:00 I guess it is pointless for me to make another issue.

Token: https://tricorder.pi-hole.net/h3bf24trps

takerukoushirou commented 3 years ago

Issue appeared again, created debug token (added to initial report) and restarted pihole-FTL via systemctl. Database size is unchanged at exactly 4345.02 MB.

technicalpyro commented 3 years ago

Are you running any external scripts that affect the DB such as gravity sync?

takerukoushirou commented 3 years ago

Nothing external, just a basic Pi-hole installation.

I also just had a look at the database itself, a VACUUM only brought it down about 200 MiB, still around 4 GiB at the moment. Tables are all small except for queries which contains over 41 million entries, of which nearly 12 million alone are IPv6 reverse lookups.

Except for that the database is rather large I do not know if the issues are related to the database itself. FTL just stops responding, connections from clients keep indefinitely hanging. Nothing (for me) unusual in the logs, too. Earlier when I restarted FTL the issue appeared again only a few minutes later, this time FTL also refused to stop and had to be killed. Since then it's running well again.

DL6ER commented 3 years ago

I agree that the issue shouldn't come from the database size.

Please add

DEBUG_ALL=true

to the file /etc/pihole/pihole-FTL.conf (create if it does not exist) and restart pihole-FTL. This will make FTL a lot more verbose.

One the issue happens again, you do not need to generate a debug log, it's also fine if you just copy the last lines from FTL's log. You can get them via

tail -n 100 /var/log/pihole-FTL.log

I'd then suggest to set DEBUG_ALL=false in the file above before you restart the DNS resolver. With all debug settings enabled, the log will grow in file size really fast, possibly filling your disk within hours.

takerukoushirou commented 3 years ago

Thank you, verbose debug logging for FTL is active and I will report the FTL log tail when the issue appears again.

dvdvideo1234 commented 3 years ago

Updated the config file /etc/pihole/pihole-FTL.conf and ran sudo service pihole-FTL restart Now we wait for it to go down again...

Attached: pihole-FTL.log

DL6ER commented 3 years ago

Any update for us?

takerukoushirou commented 3 years ago

So far nothing, last time it took 9 days for the issue to re-surface (seems very random). I'm watching the log size and truncate it every now and then, and will post the tail once FTL freezes again. Could take some time.

binary-person commented 3 years ago

I managed to reproduce this in a short period of time using this bash script (saved to a file called trytocrashftl.bash for later reference):

#!/bin/bash

count=0
randomDomain=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 13 ; echo '')
start_time="$(date -u +%s)"

if [ "$1" == "samedomain" ]
then
    echo Using same domain
fi

while true
do
    ((count++))
    if [ "$1" == "samedomain" ]
    then
        dig +short +tcp +time=2 +tries=2 @127.0.0.1 ${randomDomain}.com || break
    else
        dig +short +tcp +time=2 +tries=2 @127.0.0.1 ${randomDomain}_$count.com || break
    fi
done

end_time="$(date -u +%s)"
elapsed="$(($end_time-$start_time))"

echo It took $elapsed seconds and $count queries to hang pihole-FTL

Before running it, I added RATE_LIMIT=0/0 to /etc/pihole/pihole-FTL.conf

The first few times I tried it, it took around 60 seconds to crash, and another took 120 seconds. Occasionally it would take a long time to hang pihole-FTL. Might be a ~~race condition~~ (not. see below), not sure though, but that might explain why it's so random.

~~Another thing I tried is I restarted the DNS server, and immediately after ran sleep 30 && ./trytocrash.bash and to my prediction, it took 30 seconds, indicating a 60 second interval task was somehow conflicting with the querying.~~ nvm. seems to be random. the 30 seconds was purely coincidental. In some occasions it hangs around 12 seconds.

Finally, I tried ./trytocrash.bash samedomain and after running a handful of times, it did not crash FTL at all, ~~indicating that it has something to do with gravityDB.~~ nvm. see below comment

I added logging around https://github.com/pi-hole/FTL/blob/497a8426353b85e34ebbc086834b4083b491c829/src/database/gravity-db.c#L1214 and gravityDB_prepare_client_statements never finishes. Hope this helps

binary-person commented 3 years ago

I think I finally figured it out.

To start things, addstr() in shmem.c calls realloc_shm, which is responsible for Resizing "FTL-strings" (also moves pointers). So calling addstr() has a possibility that it resizes ftl-strings which has a chance of moving pointers.

The problem is here: https://github.com/pi-hole/FTL/blob/497a8426353b85e34ebbc086834b4083b491c829/src/database/gravity-db.c#L818

When addstr() is called here https://github.com/pi-hole/FTL/blob/497a8426353b85e34ebbc086834b4083b491c829/src/database/gravity-db.c#L621 there is a chance that realloc_shm will move pointers which will cause clientip to hold a stale pointer and by attempting to get the string value using the stale pointer, it will hang pihole.

So in short, the following conditions are required for it to hang (for my case. there might be other cases where the conditions are similar):

gravityDB_prepare_client_statements initializes a const pointer with getstr(client->ippos)
gravityDB_prepare_client_statements calls realloc_shm indirectly through get_client_groupids through addstr
realloc_shm moves pointers
code uses old stale const pointer after indirectly calling realloc_shm (which is used by logg())

Might explain why Resizing "FTL-strings" always gets printed around the time of the hang.

@DL6ER you might want to take a look at this

Oh and here are the logs:

[2021-03-16 12:25:41.461 1030434/F1022680] EDNS(0) pheader[22] = 0x48
[2021-03-16 12:25:41.461 1030434/F1022680] EDNS(0) requestor's UDP payload size: 4096 bytes
[2021-03-16 12:25:41.461 1030434/F1022680] EDNS(0) code 10, optlen 8 (bytes 4 - 12 of 12)
[2021-03-16 12:25:41.461 1030434/F1022680] EDNS(0) COOKIE (client-only): 5F3FF1654D2BA548
[2021-03-16 12:25:41.461 1030434/F1022680] Waiting for SHM lock in _FTL_new_query() (/home/ubuntu/workspace/finished-projects/fix-FTL/src/dnsmasq_interface.c:570)
[2021-03-16 12:25:41.461 1030434/F1022680] Obtained lock SHM for _FTL_new_query() (/home/ubuntu/workspace/finished-projects/fix-FTL/src/dnsmasq_interface.c:570)
[2021-03-16 12:25:41.461 1030434/F1022680] **** new TCP query[A] query "1ptamhwnauhgz_7676.com" from eth0:another.ip.redacted (ID 767576, FTL 7750, /home/ubuntu/workspace/finished-projects/fix-FTL/src/dnsmasq/forward.c:2041)
[2021-03-16 12:25:41.461 1030434/F1022680] getOverTimeID(1615897500): 146
[2021-03-16 12:25:41.461 1030434/F1022680] Adding "1ptamhwnauhgz_7676.com" (len 23) to buffer. next_str_pos is 192488
[2021-03-16 12:25:41.461 1030434/F1022680] Added string to buffer. next_str_pos is 192511
[2021-03-16 12:25:41.461 1030434/F1022680] find_mac("another.ip.redacted") returned 0 bytes of data
[2021-03-16 12:25:41.461 1030434/F1022680] 1ptamhwnauhgz_7676.com is not known
[2021-03-16 12:25:41.461 1030434/F1022680] Getting sqlite3_stmt** 0x559615459fa0[1] --> (nil)
[2021-03-16 12:25:41.461 1030434/F1022680] Initializing gravity statements for another.ip.redacted
[2021-03-16 12:25:41.461 1030434/F1022680] Querying gravity database for client with IP another.ip.redacted...
[2021-03-16 12:25:41.461 1030434/F1022680] --> No record for another.ip.redacted in the client table
[2021-03-16 12:25:41.461 1030434/F1022680] Querying gravity database for MAC address of another.ip.redacted...
[2021-03-16 12:25:41.461 1030434/F1022680] getMACfromIP("another.ip.redacted"): Running "SELECT hwaddr FROM network WHERE id = (SELECT network_id FROM network_addresses WHERE ip = ? GROUP BY ip HAVING max(lastSeen));"
[2021-03-16 12:25:41.461 1030434/F1022680] Found database hardware address another.ip.redacted -> ip-another.ip.redacted
[2021-03-16 12:25:41.461 1030434/F1022680] Skipping mock-device hardware address lookup
[2021-03-16 12:25:41.461 1030434/F1022680] Querying gravity database for host name of another.ip.redacted...
[2021-03-16 12:25:41.462 1030434/F1022680]  ---> not found
[2021-03-16 12:25:41.462 1030434/F1022680] --> No result.
[2021-03-16 12:25:41.462 1030434/F1022680] Querying gravity database for interface of another.ip.redacted...
[2021-03-16 12:25:41.462 1030434/F1022680] getDatabaseHostname(): "SELECT interface FROM network JOIN network_addresses ON network_addresses.network_id = network.id WHERE network_addresses.ip = ? AND interface != 'N/A' AND interface IS NOT NULL;" with ? = "another.ip.redacted"
[2021-03-16 12:25:41.462 1030434/F1022680] Found database interface another.ip.redacted -> eth0
[2021-03-16 12:25:41.462 1030434/F1022680] Querying client table for interface :eth0
[2021-03-16 12:25:41.462 1030434/F1022680] --> There is no record for interface :eth0 in the client table
[2021-03-16 12:25:41.462 1030434/F1022680] Gravity database: Client another.ip.redacted not found. Using default group.

[2021-03-16 12:25:41.462 1030434/F1022680] Adding "0" (len 2) to buffer. next_str_pos is 192511
[2021-03-16 12:25:41.462 1030434/F1022680] Resizing "FTL-strings" from 192512 to (196608 * 1) == 196608 (/dev/shm: 1.5MB used, 514.5MB total, FTL uses 1.4MB)
[2021-03-16 12:25:41.462 1030434/F1022680] Opening SHM object "FTL-strings"
[2021-03-16 12:25:41.462 1030434/F1022680] Allocating 196608 bytes
[2021-03-16 12:25:41.462 1030434/F1022680] Remapping SHM pointer
[2021-03-16 12:25:41.462 1030434/F1022680] Pointer moved (0x7f1d83584000 --> 0x7f1d86815000)
[2021-03-16 12:25:41.462 1030434/F1022680] Added string to buffer. next_str_pos is 192513

DL6ER commented 3 years ago

@binary-person Thanks for the in-depth research. However, I don't yet think this will be the cause of the issue: I agree that this is something we should fix. I still don't believe this is responsible for the hanging because accessing a stale pointer can only result in two things: (a) a crash (SEGFAULT) or (b) just accessing invalid data without a crash. c is sufficiently low-level that don't see any possibility for memory-access to hang anywhere. Have you ensured that fixing this solves the issue you're seeing?

Furthermore, if you quote one more line than you did, you'll see that this code wouldn't even be called during normal (non-debug) operation: https://github.com/pi-hole/FTL/blob/497a8426353b85e34ebbc086834b4083b491c829/src/database/gravity-db.c#L817-L818

My home network is rather simple. Especially, I have no real "smart" devices and, hence, I don't see many (read as: not really any) TCP requests myself. This may be at the root of why I wasn't able to reproduce. Anyway, I set up a few mock devices heavily querying over TCP and am checking out a few minor memory leaks with heavily forking in branch tweak/memory. I found the same error you described and included a fix for this. I approached it by preventing FTL to be able to change the pointer during a lock is held. We ensure once when locking that enough memory is available and rely on this. If, for any reason, a single even requires us to store more than 8 KB of data, we have a fallback in place and try again next time.

DL6ER commented 3 years ago

@binary-person I tried your script and see it complaining without FTL actually becoming unresponsive:

;; connection timed out; no servers could be reached

It took 46 seconds and 119 queries to hang pihole-FTL

;; connection timed out; no servers could be reached

It took 17 seconds and 36 queries to hang pihole-FTL

and so on.

The log tells us that FTL didn't stop, it was just slow on responding. This slowness was inside dnsmasq - further investigations are necessary to identify what caused the delay (I will do this).

[2021-03-16 13:58:02.246 5687/F4936] **** new TCP query[A] query "fki5gkhd7ndoe_146.com" from lo:127.0.0.1 (ID 14602, FTL 351, /home/derigs/FTL/src/dnsmasq/forward.c:2040)
[2021-03-16 13:58:02.246 5687/F4936] fki5gkhd7ndoe_146.com is known as not to be blocked
[2021-03-16 13:58:02.246 5687/F4936] Removed lock in _FTL_new_query() (/home/derigs/FTL/src/dnsmasq_interface.c:767)

[2021-03-16 13:58:04.053 4936/T4941] Waiting for SHM lock in GC_thread() (/home/derigs/FTL/src/gc.c:52)
[2021-03-16 13:58:04.053 4936/T4941] Obtained lock SHM for GC_thread() (/home/derigs/FTL/src/gc.c:52)
[2021-03-16 13:58:04.053 4936/T4941] Removed lock in GC_thread() (/home/derigs/FTL/src/gc.c:54)

[2021-03-16 13:58:06.876 5687/F4936] Waiting for SHM lock in _FTL_forwarded() (/home/derigs/FTL/src/dnsmasq_interface.c:829)
[2021-03-16 13:58:06.876 5687/F4936] Obtained lock SHM for _FTL_forwarded() (/home/derigs/FTL/src/dnsmasq_interface.c:829)
[2021-03-16 13:58:06.876 5687/F4936] **** forwarded Fki5gkHD7nDOE_146.com to 192.168.2.10#53 (ID 14602, /home/derigs/FTL/src/dnsmasq/forward.c:2254)
[2021-03-16 13:58:06.876 5687/F4936] Removed lock in _FTL_forwarded() (/home/derigs/FTL/src/dnsmasq_interface.c:956)
[2021-03-16 13:58:06.876 5687/F4936] Waiting for SHM lock in _FTL_reply() (/home/derigs/FTL/src/dnsmasq_interface.c:1008)
[2021-03-16 13:58:06.876 5687/F4936] Obtained lock SHM for _FTL_reply() (/home/derigs/FTL/src/dnsmasq_interface.c:1008)
[2021-03-16 13:58:06.876 5687/F4936] **** got reply Fki5gkHD7nDOE_146.com is (NXDOMAIN) (ID 14602, /home/derigs/FTL/src/dnsmasq/cache.c:494)
[2021-03-16 13:58:06.876 5687/F4936]      Flags: F_FORWARD F_NEG F_IPV4 F_NXDOMAIN 
[2021-03-16 13:58:06.876 5687/F4936] Removed lock in _FTL_reply() (/home/derigs/FTL/src/dnsmasq_interface.c:1147)
[2021-03-16 13:58:06.876 5687/F4936] TCP worker terminating (client disconnected)
[2021-03-16 13:58:06.876 5687/F4936] Finalizing gravity statements for 127.0.0.1
[2021-03-16 13:58:06.877 5687/F4936] Unlocking FTL database
[2021-03-16 13:58:06.877 5687/F4936] Closing FTL database
[2021-03-16 13:58:06.877 5687/F4936] Unlocking FTL database: Success

by increasing the acceptable delay from 2 to 20 seconds, I'm running your script for already 10 minutes without issues. The longest delay I've seen is on the order of 6 seconds.

binary-person commented 3 years ago

Have you ensured that fixing this solves the issue you're seeing?

yeah. it's pretty strange. I added logging around the logg() and I've confirmed that it's stuck at that line.

I switched to the tweak/memory branch and with your fix, it doesn't freeze anymore with my bash script (ran up to 25 minutes when another hang occurred not related to the stale pointer issue). However, in rare occasions, I'm experiencing the long timeout you've described. I've attached the ftl log file when that happened (running under the tweak/memory branch)

text.txt

binary-person commented 3 years ago

Forgot to mention, when the hang occurred this time, restarting the daemon had no problems unlike before.

DL6ER commented 3 years ago

Okay, so we already improved. This is great!

I found where the delay was happening. I'm not sure it'll be the same as in your setup, though. I'm testing this on a dedicated Pi-hole (test) which forwards everything to a second Pi-hole (main). The main Pi-hole runs on a very (!) slow device and the following happened:

test sent out the query: new TCP query[A] query fki5gkhd7ndoe_146.com
main was busy processing my network table (it does this once per minute) and, because main is so slow, this took 6 seconds.
Only thereafter, main started receiving the query over TCP. test logged this as forwarded Fki5gkHD7nDOE_146.com
main immediately responded because it wasn't busy at this moment.

Looking at the log excerpt you posted, I cannot reconstruct what the issue really was, however, the database seems to be blocking:

[2021-03-16 14:15:00.141 1087210/T1087214] dbquery: "UPDATE network SET interface = ?1 WHERE id = ?2" with arguments ?1 = "docker0" and ?2 = 11

My most recent change should make the database operations on the network-table non-blocking, however, this is somewhat experimental (I may have missed some spots) so testing and trying to break it is appreciated as always.

binary-person commented 3 years ago

interesting. I'm running pihole on DigitalOcean (locked down using iptables) so there's no hardware concerns for me.

Here's what the log looked like when it hanged and cleared itself up shortly after (running on commit cf73475)

[2021-03-16 16:26:00.139 1273449/T1273453] Network table (ip a): Client with MAC 6e:79:17:c5:f2:94 was recently be seen for network ID 10
[2021-03-16 16:26:00.145 1273449/T1273453] DEBUG: MAC Vendor lookup for 6e:79:17:c5:f2:94 returned ""
[2021-03-16 16:26:00.145 1273449/T1273453] Network table: Updating existing ip a device MAC = 6e:79:17:c5:f2:94, IP = 169.254.9.75, interface = "eth1"
[2021-03-16 16:26:00.145 1273449/T1273453] dbquery: "INSERT OR REPLACE INTO network_addresses (network_id,ip,lastSeen,name,nameUpdated) VALUES (?1,?2,(cast(strftime('%s', 'now') as int)),(SELECT name FROM network_addresses WHERE ip = ?2),(SELECT nameUpdated FROM network_addresses WHERE ip = ?2));" with arguments ?1 = 10 and ?2 = "169.254.9.75"
[2021-03-16 16:26:00.145 1273449/T1273453] dbquery: "UPDATE network SET interface = ?1 WHERE id = ?2" with arguments ?1 = "eth1" and ?2 = 10
[2021-03-16 16:26:00.145 1273449/T1273453] Network table: read interface details for interface eth1 (6e:79:17:c5:f2:94) with address fe80::6c79:17ff:fec5:f294
[2021-03-16 16:26:00.145 1273449/T1273453] dbquery: "SELECT id FROM network WHERE hwaddr = '6e:79:17:c5:f2:94' COLLATE NOCASE;"
[2021-03-16 16:26:00.145 1273449/T1273453]          ---> Result 10 (int)
[2021-03-16 16:26:00.145 1273449/T1273453] Network table (ip a): Client with MAC 6e:79:17:c5:f2:94 was recently be seen for network ID 10
[2021-03-16 16:26:00.150 1273449/T1273453] DEBUG: MAC Vendor lookup for 6e:79:17:c5:f2:94 returned ""
[2021-03-16 16:26:00.151 1273449/T1273453] Network table: Updating existing ip a device MAC = 6e:79:17:c5:f2:94, IP = fe80::6c79:17ff:fec5:f294, interface = "eth1"
[2021-03-16 16:26:00.151 1273449/T1273453] dbquery: "INSERT OR REPLACE INTO network_addresses (network_id,ip,lastSeen,name,nameUpdated) VALUES (?1,?2,(cast(strftime('%s', 'now') as int)),(SELECT name FROM network_addresses WHERE ip = ?2),(SELECT nameUpdated FROM network_addresses WHERE ip = ?2));" with arguments ?1 = 10 and ?2 = "fe80::6c79:17ff:fec5:f294"
[2021-03-16 16:26:00.151 1273449/T1273453] dbquery: "UPDATE network SET interface = ?1 WHERE id = ?2" with arguments ?1 = "eth1" and ?2 = 10
[2021-03-16 16:26:00.151 1273449/T1273453] Network table: read interface details for interface docker0 (02:42:0d:89:56:13) with address 172.17.0.1
[2021-03-16 16:26:00.151 1273449/T1273453] dbquery: "SELECT id FROM network WHERE hwaddr = '02:42:0d:89:56:13' COLLATE NOCASE;"
[2021-03-16 16:26:00.151 1273449/T1273453]          ---> Result 11 (int)
[2021-03-16 16:26:00.151 1273449/T1273453] Network table (ip a): Client with MAC 02:42:0d:89:56:13 was recently be seen for network ID 11
[2021-03-16 16:26:00.156 1273449/T1273453] DEBUG: MAC Vendor lookup for 02:42:0d:89:56:13 returned ""
[2021-03-16 16:26:00.156 1273449/T1273453] Network table: Updating existing ip a device MAC = 02:42:0d:89:56:13, IP = 172.17.0.1, interface = "docker0"
[2021-03-16 16:26:00.156 1273449/T1273453] dbquery: "INSERT OR REPLACE INTO network_addresses (network_id,ip,lastSeen,name,nameUpdated) VALUES (?1,?2,(cast(strftime('%s', 'now') as int)),(SELECT name FROM network_addresses WHERE ip = ?2),(SELECT nameUpdated FROM network_addresses WHERE ip = ?2));" with arguments ?1 = 11 and ?2 = "172.17.0.1"
[2021-03-16 16:26:00.156 1273449/T1273453] dbquery: "UPDATE network SET interface = ?1 WHERE id = ?2" with arguments ?1 = "docker0" and ?2 = 11
[2021-03-16 16:26:00.188 1273449/T1273454] Waiting for SHM lock in GC_thread() (/home/ubuntu/workspace/finished-projects/fix-FTL/src/gc.c:52)
[2021-03-16 16:26:02.107 1447284/F1273449] TCP worker forked for client mock.ip.redacted on interface eth0 with IP pihole.ip.redacted
[2021-03-16 16:26:02.107 1447284/F1273449] Reopening Gravity database for this fork
[2021-03-16 16:26:02.107 1447284/F1273449] Waiting for SHM lock in FTL_TCP_worker_created() (/home/ubuntu/workspace/finished-projects/fix-FTL/src/dnsmasq_interface.c:2144)
[2021-03-16 16:26:03.524 1273449M] Waiting for SHM lock in _FTL_new_query() (/home/ubuntu/workspace/finished-projects/fix-FTL/src/dnsmasq_interface.c:571)
[2021-03-16 16:31:00.107 1447282/F1273449] TCP worker terminating (timeout)
[2021-03-16 16:31:00.108 1447282/F1273449] Finalizing gravity statements for private.ip.redacted
[2021-03-16 16:31:00.108 1447282/F1273449] Getting sqlite3_stmt** 0x7fe2500551a0[0] --> 0x7fe25002bcc8
[2021-03-16 16:31:00.108 1447282/F1273449] Getting sqlite3_stmt** 0x7fe2500551a0[0] --> 0x7fe25002bcc8

DL6ER commented 3 years ago

This should be improved on with ba6d7f51.

DL6ER commented 3 years ago

I stopped the script after 3450 seconds and 192444 queries without any issue.

binary-person commented 3 years ago

I've been running the script for 8.23 hours using ba6d7f5 and haven't had a problem since. I'm gonna keep it running for a day just to make sure but I think the hanging issue is resolved.

dvdvideo1234 commented 3 years ago

This night it fell for me again, so I had to change the DNS to 8.8.8.8 Is there a way to revert the update, because it was fine before?

DL6ER commented 3 years ago

@dvdvideo1234

Is there a way to revert the update, because it was fine before?

Which update?

binary-person commented 3 years ago

left it running last night and it was pretty stable, until I went to the web dashboard and clicked on query log and long-term data. Data for all queries last night were gone after I restarted it. I attached the two screenshots and the log file.

I don't know if it's related to the recent commit you pushed but it might be beyond the scope of this github issue.

Shortly before crash: shortly_before_crash

After restart after_restart

pihole-ftl-log.txt

dvdvideo1234 commented 3 years ago

@DL6ER A couple of weeks ago I saw there were new version of the core and FTL. Ran pihole -up and updated to the current most recent revision. I did not update ever since.

binary-person commented 3 years ago

Oh yeah. I also updated and suddenly experienced this hanging issue

DL6ER commented 3 years ago

@binary-person I checked the log you attached and found the

[2021-03-17 13:37:19.842 808/T1570488] Received signal: Segmentation fault [2021-03-17 13:37:19.842 808/T1570488] at address: 0x7fb1237ff000 [2021-03-17 13:37:19.842 808/T1570488] with code: SEGV_ACCERR (Invalid permissions for mapped object)

crash. We're only accessing static memory - with the sole exception of the shared memory counters at this point in time. So the best I can currently come up with that could explain what is happening here would be that there was an external influence on the /dev/shm objects.

bershanskiy commented 3 years ago

I spotted a similar issue on my setup yesterday, with similar log entry:

[2021-03-18 11:20:47.095 499M] Resizing "FTL-strings" from 163840 to (167936 * 1) == 167936 (/dev/shm: 1.9MB used, 225.5MB total, FTL uses 1.8MB)

Versions: Pi-hole v5.2.4 Web Interface vDev (devel, v5.4-15-g1195920) FTL v5.7

My network does not actually hang because Pi-Hole is only the first DNS resolver (not managing DHCP leases) and the router falls back to regular DNS resolver if Pi-Hole fails.

So the best I can currently come up with that could explain what is happening here would be that there was an external influence on the /dev/shm objects.

As far as I know, I did not modify these files. Is there a way for me to check? Would it be helpful to you if I uploaded these files?

DL6ER commented 3 years ago

@bershanskiy Feel free to run

sudo pihole checkout ftl tweak/memory

to test the fix proposed in here.

the router falls back to regular DNS resolver if Pi-Hole fails.

How does this fallback work? Are you sure that some queries may not leak through even when the Pi-hole works as usual?

Is there a way for me to check?

No, not really. We had once an issue with an external software (it may have been Zoneminder IIRC) which was messing with SHM files. This led to spontaneous completely indeterminable crashes that looked very similar. Do you have other software installed next to Pi-hole on the same device/machine?

Would it be helpful to you if I uploaded these files?

No, unfortunately, not at all.

binary-person commented 3 years ago

So the best I can currently come up with that could explain what is happening here would be that there was an external influence on the /dev/shm objects.

I'm willing to bet on a clean ubuntu installation that nothing was messing with the /dev/shm files. Only things I've installed on there are docker and pihole.

anyway, I tested the releases on a fresh vps from v5.0 and up and all of them have the same problem, although the newer versions hanged for a bit before resuming. Only the tweak/memory branch didn't hang at all.

Have a question: will tweak/memory fix the source of the long wait? and will it save to the database without any data loss when the ftl is restarted (I'm experiencing this issue)? I'm thinking the hang affects saving to database.

dschaper commented 3 years ago

I'm willing to bet on a clean ubuntu installation that nothing was messing with the /dev/shm files.

Is this on bare metal or is there a hypervisor or VM involved?

bershanskiy commented 3 years ago

How does this fallback work? Are you sure that some queries may not leak through even when the Pi-hole works as usual?

I just route all plain-text DNS on port 53 to Pi-Hole and if it is unresponsive fall back to regular resolver. I know this setup is not leak-proof (e.g., DoH and DoT bypass this setup) but I prefer "fail-safe" over "fail-secure" in this case.

Do you have other software installed next to Pi-hole on the same device/machine?

This is a bare-metal Raspberry Pi Zero W with nothing else on it.

I did not see a crash yet since switching to tweak/memory about 8 hours ago.

Edit: removed confusing stuff.

dschaper commented 3 years ago

We're getting multiple people with different configurations all responding to questions asked to other users. This will cause a great deal of confusion and frustration of the developers.

I asked about the hypervisor since I read that this is on Ubuntu on a VPS.

binary-person commented 3 years ago

Is this on bare metal or is there a hypervisor or VM involved?

hosted KVM

I asked about the hypervisor since I read that this is on Ubuntu on a VPS.

correct

dschaper commented 3 years ago

hosted KVM

Self-hosted? I'm trying to find out if your /dev/shm is shared or not with the hypervisor.

binary-person commented 3 years ago

digitalocean. their kvm machines are pretty isolated

DL6ER commented 3 years ago

When anything happens that causes FTL to stop working, it'll also stop updating the database. On a restart these queries will be lost. I don't think there is much/anything we can do about it, I wouldn't want to start the (rather complex) storing to the database when experiencing a crash in some other part of the application. This has way too much potential to kill your entire database by corrupting it.

So, so I get this right: This branch fixes all the issues for you, except there was this crash (which is likey unrelated to our changes here)? Are you able to reproduce the crash or was this a one-time thing?

binary-person commented 3 years ago

the crash is likely a one-time thing, and since i can't reproduce it, i wouldn't worry about that.

however, running my script for an hour and restarting the dns makes it appear that the dns requests never happened. that's the only issue i have experienced so far. otherwise, there seems to be no more hang

DL6ER commented 3 years ago

@binary-person can you check if there are any database related errors in the log? grep for maybe SQL or database

binary-person commented 3 years ago

$ grep -E 'SQL|database' /var/log/pihole-FTL.log
[2021-03-19 00:16:46.013 843079M]    DBIMPORT: Importing history from database
[2021-03-19 00:16:46.014 843079M]    NAMES_FROM_NETDB: Enabled, trying to get names from network database
[2021-03-19 00:17:45.413 843079M] Imported 117142 queries from the long-term database
[2021-03-19 01:06:00.154 843090/T843095] ERROR: SQL query "END TRANSACTION" failed: disk I/O error

I'm going to add DEBUG_ALL=true to the config and delete the db file for a fresh start.

DL6ER commented 3 years ago

ERROR: SQL query "END TRANSACTION" failed: disk I/O error

Maybe https://stackoverflow.com/a/10003153

i'm going to add debug_all=true to the config and delete the db file for a fresh start.

Okay.

binary-person commented 3 years ago

[2021-03-19 13:04:00.122 2370845/F2370487] Setting sqlite3_stmt** 0x560da37880f0[0] <-- 0x560da3776078
[2021-03-19 13:04:00.122 2370845/F2370487] Getting sqlite3_stmt** 0x560da376d9b0[0] --> 0x560da37785f8
[2021-03-19 13:04:00.122 2370845/F2370487] domain_in_list("eswuisvlvsrok_62.com", 0x560da37785f8, whitelist): 0
[2021-03-19 13:04:00.122 2370845/F2370487] Getting sqlite3_stmt** 0x560da37880f0[0] --> 0x560da3776078
[2021-03-19 13:04:00.122 2370845/F2370487] domain_in_list("eswuisvlvsrok_62.com", 0x560da3776078, blacklist): 0
[2021-03-19 13:04:00.122 2370845/F2370487] Getting sqlite3_stmt** 0x560da376ddc0[0] --> 0x560da3778f58
[2021-03-19 13:04:00.122 2370845/F2370487] domain_in_list("eswuisvlvsrok_62.com", 0x560da3778f58, gravity): 0
[2021-03-19 13:04:00.123 2370845/F2370487] Removed lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:767)
[2021-03-19 13:04:00.123 2370487/T2370491] DEBUG: MAC Vendor lookup for 2e:cd:5f:91:05:91 returned ""
[2021-03-19 13:04:00.123 2370487/T2370491] Network table: Updating existing ip a device MAC = 2e:cd:5f:91:05:91, IP = fe80::2ccd:5fff:fe91:591, interface = "eth1"
[2021-03-19 13:04:00.123 2370487/T2370491] dbquery: "INSERT OR REPLACE INTO network_addresses (network_id,ip,lastSeen,name,nameUpdated) VALUES (?1,?2,(cast(strftime('%s', 'now') as int)),(SELECT name FROM network_addresses WHERE ip = ?2),(SELECT nameUpdated FROM network_addresses WHERE ip = ?2));" with arguments ?1 = 4 and ?2 = "fe80::2ccd:5fff:fe91:591"
[2021-03-19 13:04:00.123 2370487/T2370491] dbquery: "UPDATE network SET interface = ?1 WHERE id = ?2" with arguments ?1 = "eth1" and ?2 = 4
[2021-03-19 13:04:00.123 2370487/T2370491] dbquery: "DELETE FROM network WHERE id NOT IN (SELECT network_id from network_addresses) AND hwaddr LIKE 'ip-%';"
[2021-03-19 13:04:00.123 2370487/T2370491]          ---> OK
[2021-03-19 13:04:00.123 2370487/T2370491] dbquery: "END TRANSACTION"
[2021-03-19 13:04:00.125 2370487/T2370491] ERROR: SQL query "END TRANSACTION" failed: disk I/O error
[2021-03-19 13:04:00.125 2370487/T2370491] Unlocking FTL database
[2021-03-19 13:04:00.126 2370487/T2370491] Closing FTL database
[2021-03-19 13:04:00.126 2370487/T2370491] Unlocking FTL database: Success
[2021-03-19 13:04:00.126 2370487/T2370491] Unlocking FTL database
[2021-03-19 13:04:00.126 2370487/T2370491] Unlocking FTL database: already NULL
[2021-03-19 13:04:00.126 2370487/T2370491] Unlocking FTL database: Success
[2021-03-19 13:04:00.126 2370487/T2370491] ERROR: Storing devices in network table failed: disk I/O error
[2021-03-19 13:04:00.201 2370845/F2370487] Waiting for SHM lock in _FTL_forwarded() (/root/project/src/dnsmasq_interface.c:829)
[2021-03-19 13:04:00.202 2370845/F2370487] Obtained SHM lock for _FTL_forwarded() (/root/project/src/dnsmasq_interface.c:829)
[2021-03-19 13:04:00.202 2370845/F2370487] **** forwarded eSWUISVlvSrok_62.com to 1.0.0.1#53 (ID 6103, /root/project/src/dnsmasq/forward.c:2255)
[2021-03-19 13:04:00.202 2370845/F2370487] Removed lock in _FTL_forwarded() (/root/project/src/dnsmasq_interface.c:956)
[2021-03-19 13:04:00.202 2370845/F2370487] Waiting for SHM lock in _FTL_reply() (/root/project/src/dnsmasq_interface.c:1008)
[2021-03-19 13:04:00.202 2370845/F2370487] Obtained SHM lock for _FTL_reply() (/root/project/src/dnsmasq_interface.c:1008)
[2021-03-19 13:04:00.202 2370845/F2370487] **** got reply eSWUISVlvSrok_62.com is (NXDOMAIN) (ID 6103, /root/project/src/dnsmasq/cache.c:494)
[2021-03-19 13:04:00.202 2370845/F2370487]      Flags: F_FORWARD F_NEG F_IPV4 F_NXDOMAIN 
[2021-03-19 13:04:00.202 2370845/F2370487] Removed lock in _FTL_reply() (/root/project/src/dnsmasq_interface.c:1147)
[2021-03-19 13:04:00.204 2370845/F2370487] TCP worker terminating (client disconnected)
[2021-03-19 13:04:00.204 2370845/F2370487] Finalizing gravity statements for 127.0.0.1
[2021-03-19 13:04:00.204 2370845/F2370487] Getting sqlite3_stmt** 0x560da376d9b0[0] --> 0x560da37785f8

I repeated this multiple times and every time data loss occurs (when I restart pihole), disk I/O error gets printed.

binary-person commented 3 years ago

DL6ER, are you able to reproduce the issue? Steps to produce are running the bash script for at least 10 minutes (the longer it's running, the more likely it fails) and running pihole restartdns, and watching the web dashboard "Total queries." I'm not sure if it's within the scope of this issue though.

DL6ER commented 3 years ago

@binary-person I was not able to reproduce this, however, looking at your log, I have the feeling that this may be caused by SQLite being to extreme in freeing resources when we try to clean up the database connection in the just created fork. This is in agreement with the SQLite3 documentation How To Corrupt An SQLite Database File, more specifically the section 2.6. Carrying an open database connection across a fork(). The way we do it now is trying to avoid memory leaks but reading the documentation over there, SQLite just cannot do it so we will leak some memory in the fork. I checked this with valgrind and we typically loose 187,257 (712 direct, 186,545 indirect) bytes in 1 block. I'll try to come up with a solution for this, but as the SQLite3 engine just cannot handle this, it seems wrong to try to do this ourselves with any kind of dirty trick. So we will loose the memory.

The edge case you describe seems to be triggered when the system is updating the network table while a fork is being created. This can only happen under heavy TCP query load and with precise timing of things so thanks for catching this. FTL always stops adding queries to the database whenever it finds a problem - trying to not make any further damage so recovery is simpler. This approach seems to have worked in your case because you only use momentary database content but not the entire history already stored in the database.

Please test if the data loss is avoided in the most recent version.

binary-person commented 3 years ago

[2021-03-19 16:38:38.804 2630053M] * CAP_WAKE_ALARM           (35) = --- *
[2021-03-19 16:38:38.804 2630053M] * CAP_BLOCK_SUSPEND        (36) = --- *
[2021-03-19 16:38:38.804 2630053M] * CAP_AUDIT_READ           (37) = --- *
[2021-03-19 16:38:38.804 2630053M] ***************************************
[2021-03-19 16:38:38.811 2630053/T2630058] Waiting for SHM lock in GC_thread() (/root/project/src/gc.c:52)
[2021-03-19 16:38:38.811 2630053/T2630058] Obtained SHM lock for GC_thread() (/root/project/src/gc.c:52)
[2021-03-19 16:38:38.811 2630053/T2630058] Resizing "FTL-strings" from 40960 to (81920 * 1) == 81920 (/dev/shm: 737.3KB used, 514.5MB total, FTL uses 725.3KB)
[2021-03-19 16:38:38.811 2630053/T2630058] SHMEM pointer updated: 0x7fb6a886d000 -> 0x7fb6a5697000 (40960 81920)
[2021-03-19 16:38:38.811 2630053/T2630058] Removed lock in GC_thread() (/root/project/src/gc.c:54)
[2021-03-19 16:38:38.811 2630053/T2630057] Event RELOAD_GRAVITY -> was SET, now CLEARED    called from DB_thread() (/root/project/src/database/database-thread.c:81)
[2021-03-19 16:38:38.811 2630053/T2630057] Waiting for SHM lock in FTL_reload_all_domainlists() (/root/project/src/datastructure.c:457)
[2021-03-19 16:38:38.811 2630053/T2630057] Obtained SHM lock for FTL_reload_all_domainlists() (/root/project/src/datastructure.c:457)
[2021-03-19 16:38:38.811 2630053/T2630057] dbquery("DELETE FROM message;") called but database is not available!
[2021-03-19 16:38:38.811 2630053/T2630057] ERROR: flush_message_table() failed!
[2021-03-19 16:38:38.811 2630053/T2630057] gravityDB_open(): Trying to open /etc/pihole/gravity.db in read-only mode
[2021-03-19 16:38:38.811 2630053/T2630057] gravityDB_open(): Setting location for temporary object to MEMORY
[2021-03-19 16:38:38.811 2630053/T2630057] gravityDB_open(): Preparing audit query
[2021-03-19 16:38:38.812 2630053/T2630057] gravityDB_open(): Setting busy timeout to 1000
[2021-03-19 16:38:38.812 2630053/T2630057] Initializing new sqlite3_stmt* vector with size 0

After removing the db file for a fresh start, it fails to save anything, including a single query.

DL6ER commented 3 years ago

Ah, I see. I optimized too much. We need to re-open the database after forking for it to be writable. I haven't seen this in my tests because we do not fork when running inside gdb. Should be fixed now!

binary-person commented 3 years ago

You seem to have brought back the hanging issue again.

DL6ER commented 3 years ago

No, actually I created a new one. I will go once through the elderly database code and ensure we never inherit any database connection in a fork. This means removing global variables and passing pointers everywhere manually. This will take some time as this will likely be a thousand(s) line change.

pi-hole / FTL