pi-hole / FTL

The Pi-hole FTL engine
https://pi-hole.net
Other
1.37k stars 196 forks source link

DNS resolving stops during the night #1082

Closed zod1988 closed 3 years ago

zod1988 commented 3 years ago

Versions

Platform

Expected behavior

Pihole is constantly running on my Synology and was very reliable up until the last version - meaning: no reboots, no maintenance, no nothing.

Actual behavior / bug

Sometimes during the night Pihole stops resolving DNS at all. Webpages no longer load and devices can no longer connect.

If you take a look at the query log during that time you can see that no queries were processed after 03:49:45.

After you restart the resolver or reboot pihole completely everything works again for about a week.

Steps to reproduce

Steps to reproduce the behavior:

  1. Set up pihole inside a Ubuntu VM on a Synology Diskstation
  2. Leave it running

Debug Token

Screenshots

8A18D970-65E2-402D-8DDF-C0AC2A519670

Additional context

MichaIng commented 3 years ago

If I remember it correctly, FTL does not have any issues when logs are ~atomically cleared. We had some special handling for the RAM log clearing script in place some years ago, but it turned out (discussed here somewhere) that this was not required anymore, so we removed it. Logs are cleared via simple empty redirect > pihole-FTL.log, which should be near to atomic on a tmpfs.

I saw one dietpi script cleaning the log at the same time pi-hole was rotating the log.

We can try verify this:

cp -a /var/log/pihole-FTL_full.log var/log/pihole-FTL.log
cp -a /var/log/pihole_full.log var/log/pihole.log
pihole flush once quiet &
> /var/log/pihole-FTL.log &
> /var/log/pihole.log

Doing this in a loop, probably with a minimal delay after the pihole call for initiating the shell and the steps until the actual log file clearing is performed? Probably better to do the final logrotate call to skip the Pi-hole scripts wrapping it:

logrotate --force /etc/pihole/logrotate

Another approach: As Pi-hole's cron job clears logs exactly at midnight, on DietPi one can use dietpi-cron to change the hourly cron job time to something like :15, so that collisions are ruled out. The default is btw :17, so it would need to be manually changed have a chance that those are done concurrently.

But actually, I do not see a chance that this causes an FTL issue:

I'm not sure how logrotate behaves on write collisions, but the worst that can happen is malformed log file. While pihole -t might then fail to parse affected lines in /var/log/pihole.log, I'm not sure whether FTL would mind in any way, as it just goes on writing its logs but not reading them anyway? EDIT: Ah, actually logrotate should not be installed as long as RAMlog is. It is not by default on DietPi at least, as it does not make sense to run two log clearing systems, and rotating logs in a RAMlog 😉. In this case the Pi-hole log clearing is done via:

cp -p /var/log/pihole.log /var/log/pihole.log.1
echo " " > /var/log/pihole.log
chmod 644 /var/log/pihole.log

So logrotate can be further ruled out, and I'm pretty sure that those shell commands handle concurrency gracefully, at least I never saw any issue that would indicate otherwise.

Since the pihole flush call does not create any kind of lock that could be respected/checked by FTL, I also do not think that FTL has any issues when trying to log something while either logrotate or > /var/log/pihole-FTL.log writes to them.

Looking at pihole updatechecker local (running every 10 minutes, actually what for?) is does nearly nothing aside of a few git calls to check the local branch and version. Also the remote version check (running once a day) does three curl calls, nothing else. I do not believe that those two cron jobs have any chance to crash anything, and again, it's a completely independent process, not related to the FTL service. Also those do not write to the Pi-hole log files, only to system log/journal via cron service.

I'd have another look into journalctl, which combines all system and service logs, to check entries around the time when DNS stopped working.

stphn17 commented 3 years ago

Hello, I have the same problem.. Since a few days pi.hole has been stopping at random times during the night. Tonight it was 2 am, yesterday it was 4 am, etc..

Here's the debug log: https://tricorder.pi-hole.net/eXPr7fn3/

MichaIng commented 3 years ago

Please additionally check and in case paste system logs from that times:

journalctl
stphn17 commented 3 years ago

It's very strange, but journalctl starts and ends exactly between two times I had to restart FTL. DNS stopped working on Sep 4th at around 04:00 and I restarted the DNS server at around 06:30. Then, it stopped working again on Sep 5th at around 23:00, however the log only shows data until 17:36.

Seems that at least with journalctl I cannot fetch the logs between the outages.

Is there something else I can do? journalctl.txt

MichaIng commented 3 years ago

Sometimes the journalctl output is cut when it is rotated. I also didn't get the logic behind that 😄. Please try it with reversed output:

journalctl -r
stphn17 commented 3 years ago

ok, that indeed gave the correct timeframe from today: journalctl2.txt

DNS stopped working somewhere around 01:30 and I restarted it at around 06:30. image

MichaIng commented 3 years ago

This looks strange: EDIT: Or is this some normal sandboxing behaviour, triggered by su? I've never seen such, at least.

Sep 07 01:34:17 pihole-rpi systemd[1]: Removed slice User Slice of UID 999.
Sep 07 01:34:17 pihole-rpi systemd[1]: Stopped User Runtime Directory /run/user/999.
Sep 07 01:34:17 pihole-rpi systemd[1]: user-runtime-dir@999.service: Succeeded.
Sep 07 01:34:17 pihole-rpi systemd[1]: run-user-999.mount: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[1]: Stopping User Runtime Directory /run/user/999...
Sep 07 01:34:16 pihole-rpi systemd[1]: Stopped User Manager for UID 999.
Sep 07 01:34:16 pihole-rpi systemd[1]: user@999.service: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Reached target Exit the Session.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Started Exit the Session.
Sep 07 01:34:16 pihole-rpi systemd[3035]: systemd-exit.service: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Reached target Shutdown.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Sep 07 01:34:16 pihole-rpi systemd[3035]: gpg-agent-extra.socket: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Stopped target Timers.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Stopped target Paths.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Sep 07 01:34:16 pihole-rpi systemd[3035]: gpg-agent-ssh.socket: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Closed GnuPG cryptographic agent and passphrase cache.
Sep 07 01:34:16 pihole-rpi systemd[3035]: gpg-agent.socket: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Closed GnuPG network certificate management daemon.
Sep 07 01:34:16 pihole-rpi systemd[3035]: dirmngr.socket: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Sep 07 01:34:16 pihole-rpi systemd[3035]: gpg-agent-browser.socket: Succeeded.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Stopped target Sockets.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Stopped target Basic System.
Sep 07 01:34:16 pihole-rpi systemd[3035]: Stopped target Default.
Sep 07 01:34:16 pihole-rpi systemd[1]: Stopping User Manager for UID 999...
Sep 07 01:34:03 pihole-rpi systemd[1]: session-c61.scope: Succeeded.

Does this Pi-hole run in a container? In case, it has decided to shutdown, not the host system but the container system, running with a separate systemd instance systemd[3035].

On ~6:30, a new instance of such a systemd sub process with pihole user starts up:

Sep 07 06:33:53 pihole-rpi pihole-FTL[12705]: FTL started!
Sep 07 06:33:45 pihole-rpi systemd[1]: Started Session c66 of user pihole.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Startup finished in 721ms.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Reached target Default.
Sep 07 06:33:45 pihole-rpi systemd[1]: Started User Manager for UID 999.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Reached target Basic System.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Reached target Sockets.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Reached target Timers.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Sep 07 06:33:45 pihole-rpi systemd[12728]: Listening on GnuPG network certificate management daemon.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Listening on GnuPG cryptographic agent and passphrase cache.
Sep 07 06:33:45 pihole-rpi systemd[12728]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Sep 07 06:33:45 pihole-rpi systemd[12728]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Sep 07 06:33:45 pihole-rpi systemd[12728]: Reached target Paths.
Sep 07 06:33:45 pihole-rpi systemd[12728]: gpgconf: fatal error (exit status 1)
Sep 07 06:33:45 pihole-rpi systemd[12728]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error
Sep 07 06:33:44 pihole-rpi systemd[12728]: pam_unix(systemd-user:session): session opened for user pihole by (uid=0)
Sep 07 06:33:44 pihole-rpi systemd[1]: Starting User Manager for UID 999...
Sep 07 06:33:44 pihole-rpi systemd[1]: Started User Runtime Directory /run/user/999.
Sep 07 06:33:44 pihole-rpi systemd[1]: Starting User Runtime Directory /run/user/999...
Sep 07 06:33:44 pihole-rpi systemd[1]: Created slice User Slice of UID 999.
Sep 07 06:33:44 pihole-rpi su[12723]: pam_unix(su:session): session opened for user pihole by (uid=0)

It this was a Docker container, I'd expect the Docker and containerd services to be shown in system logs, so this one doesn't look like one of the official appliances.


EDIT2: Okay this seems to be common when running a systemd unit in user mode. But this shouldn't happen with the regular Pi-hole SysV service, should it?

And another question is why it stopped session-c61.scope: Succeeded. (seems to be he related session scope). Please check the Pi-hole FTL logs as well:

cat /var/log/pihole-FTL.log

The dnsmasq logs may not be related, but doesn't hurt to have a look into them as well:

cat /var/log/pihole.log
stphn17 commented 3 years ago

No, I am not using Docker or anything. The Pi-Hole instance is running on a Raspberry Pi 1B, which is running Raspbian, but I don't know if full or light, most likely the "light" version.

stphn17 commented 3 years ago

the FTL log is here: pihole-FTL.txt

in the dnsmasq log I didn't find anyhting "suspicious", it seems to be simply stoping to work and then it begins again:

Sep  7 01:33:53 dnsmasq[3051]: gravity blocked device-metrics-us.amazon.com is 0.0.0.0
Sep  7 01:34:00 dnsmasq[3051]: query[A] daypass.api-glb-fra.smoot.apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded daypass.api-glb-fra.smoot.apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[type=65] doh.dns.apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded doh.dns.apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[AAAA] doh.dns.apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded doh.dns.apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[A] doh.dns.apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded doh.dns.apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[type=65] p101-caldav.icloud.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded p101-caldav.icloud.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[AAAA] p101-caldav.icloud.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded p101-caldav.icloud.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: reply daypass.api-glb-fra.smoot.apple.com is NXDOMAIN
Sep  7 01:34:02 dnsmasq[3051]: query[A] p101-caldav.icloud.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded p101-caldav.icloud.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: query[type=65] apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com is <CNAME>
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2a01:b740:a04:4000::1c2
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2a01:b740:a10:3000::1c2
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2a01:b740:a10:3000::b2
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2a01:b740:a10:4000::b2
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2620:171:80c::1
Sep  7 01:34:02 dnsmasq[3051]: reply doh.dns.apple.com.v.aaplimg.com is 2620:171:80d::1
Sep  7 01:34:02 dnsmasq[3051]: query[AAAA] apple.com from 192.168.178.54
Sep  7 01:34:02 dnsmasq[3051]: forwarded apple.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 06:33:53 dnsmasq[12744]: started, version pi-hole-2.85 cachesize 10000
Sep  7 06:33:53 dnsmasq[12744]: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n IDN DHCP DHCPv6 Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Sep  7 06:33:53 dnsmasq[12744]: using only locally-known addresses for domain use-application-dns.net
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 192.168.178.1#53 for domain fritz.box
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 192.168.178.1#53 for domain 178.168.192.in-addr.arpa
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 2001:4dd0:100:1020:53:2:0:2#53
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 2001:4dd0:200:304:53:2:0:3#53
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 213.168.112.60#53
Sep  7 06:33:53 dnsmasq[12744]: using nameserver 81.173.194.68#53
Sep  7 06:33:53 dnsmasq[12744]: read /etc/hosts - 5 addresses
Sep  7 06:33:53 dnsmasq[12744]: read /etc/pihole/custom.list - 5 addresses
Sep  7 06:33:53 dnsmasq[12744]: read /etc/pihole/local.list - 4 addresses
Sep  7 06:33:53 dnsmasq[12744]: query[AAAA] login.live.com.fritz.box from 192.168.178.21
Sep  7 06:33:53 dnsmasq[12744]: forwarded login.live.com.fritz.box to 192.168.178.1
Sep  7 06:33:53 dnsmasq[12744]: query[AAAA] diagnostics.meethue.com from 192.168.178.43
Sep  7 06:33:53 dnsmasq[12744]: forwarded diagnostics.meethue.com to 2001:4dd0:100:1020:53:2:0:2
Sep  7 06:33:53 dnsmasq[12744]: forwarded diagnostics.meethue.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 06:33:53 dnsmasq[12744]: forwarded diagnostics.meethue.com to 213.168.112.60
Sep  7 06:33:53 dnsmasq[12744]: forwarded diagnostics.meethue.com to 81.173.194.68
Sep  7 06:33:53 dnsmasq[12744]: reply login.live.com.fritz.box is NXDOMAIN
Sep  7 06:33:53 dnsmasq[12744]: query[A] login.live.com.fritz.box from fd00::1c15:9e1a:70be:7626
Sep  7 06:33:53 dnsmasq[12744]: forwarded login.live.com.fritz.box to 192.168.178.1
Sep  7 06:33:53 dnsmasq[12744]: reply diagnostics.meethue.com is <CNAME>
Sep  7 06:33:53 dnsmasq[12744]: reply cluster-web.meethue.com is NODATA-IPv6
Sep  7 06:33:53 dnsmasq[12744]: reply login.live.com.fritz.box is NXDOMAIN
Sep  7 06:33:53 dnsmasq[12744]: query[A] diagnostics.meethue.com from 192.168.178.43
Sep  7 06:33:53 dnsmasq[12744]: cached diagnostics.meethue.com is <CNAME>
Sep  7 06:33:53 dnsmasq[12744]: forwarded diagnostics.meethue.com to 2001:4dd0:200:304:53:2:0:3
Sep  7 06:33:53 dnsmasq[12744]: query[AAAA] userlocation.googleapis.com.fritz.box from fd00::1c15:9e1a:70be:7626
Sep  7 06:33:53 dnsmasq[12744]: forwarded userlocation.googleapis.com.fritz.box to 192.168.178.1
Sep  7 06:33:53 dnsmasq[12744]: reply userlocation.googleapis.com.fritz.box is NXDOMAIN
Sep  7 06:33:53 dnsmasq[12744]: reply diagnostics.meethue.com is <CNAME>
MichaIng commented 3 years ago

Before the "shutdown", FTL logged repeatedly:

[2021-09-07 01:32:03.143 3051/T3056] WARN: Command "ip neigh show" failed!
[2021-09-07 01:32:03.144 3051/T3056]       Message: Success

Can you try to run this manually on console:

ip n s
sudo -u pihole ip n s

Since it's the armv6hf binary, it is an RPi 1 or other RPi (not sure whether the installer pulls the armv6hf on all Raspbian systems). Official Raspberry Pi OS and installed via official installer?

Since I'm curious how this service got started with that user manager wrapper, can you also show:

systemctl status pihole-FTL
stphn17 commented 3 years ago

ok, I ran both commands, identical output though..

pi@pihole-rpi:~ $ ip n s
192.168.178.26 dev eth0 lladdr bc:5f:f4:9b:28:66 STALE
192.168.178.56 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
192.168.178.40 dev eth0 lladdr 00:24:e4:28:fe:6e STALE
192.168.178.54 dev eth0 lladdr 3a:2c:63:23:71:23 REACHABLE
192.168.178.1 dev eth0 lladdr e0:28:6d:3e:6b:8e STALE
192.168.178.76 dev eth0 lladdr 08:12:a5:71:14:e2 DELAY
192.168.178.45 dev eth0 lladdr 98:da:c4:08:70:5b STALE
192.168.178.36 dev eth0 lladdr 32:3a:fd:a9:12:ba STALE
192.168.178.43 dev eth0 lladdr 00:17:88:70:5d:da STALE
192.168.178.20 dev eth0 lladdr 50:56:bf:8c:fd:7b STALE
192.168.178.27 dev eth0 lladdr b0:05:94:7b:02:cd STALE
192.168.178.4 dev eth0 lladdr 02:42:c0:a8:b2:04 STALE
192.168.178.41 dev eth0 lladdr 98:da:c4:08:6b:9f STALE
192.168.178.48 dev eth0 lladdr 66:8a:1f:9c:5e:a7 REACHABLE
192.168.178.100 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
192.168.178.55 dev eth0 lladdr f8:54:b8:77:17:c0 STALE
192.168.178.46 dev eth0 lladdr 1c:3b:f3:fb:3f:f1 STALE
192.168.178.53 dev eth0 lladdr b2:17:fc:a5:1b:81 STALE
192.168.178.23 dev eth0 lladdr 40:b4:cd:6b:e6:79 STALE
192.168.178.30 dev eth0 lladdr b0:5a:da:f6:13:ba STALE
192.168.178.21 dev eth0 lladdr a8:db:03:89:2c:d3 STALE
192.168.178.28 dev eth0 lladdr 00:e1:8c:a3:5d:f9 REACHABLE
fd00::e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
2001:4dd6:9d12:0:e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
fd00::9d6c:875d:156:d04b dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::303a:fdff:fea9:12ba dev eth0  FAILED
fd00::c18e:6fc4:7c70:483e dev eth0  FAILED
fe80::d102:6336:4940:c7ac dev eth0 lladdr a8:db:03:89:2c:d3 STALE
2001:4dd1:78f8:0:e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
fe80::106d:17c2:b450:d726 dev eth0 lladdr 3a:2c:63:23:71:23 STALE
fe80::211:32ff:feb4:e7a7 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
fd00::ac53:8cf6:cbda:75cb dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::f9:3d71:ea15:5048 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::b0c7:c205:9067:f2c2 dev eth0 lladdr 00:e1:8c:a3:5d:f9 REACHABLE
fd00::1c9b:85fe:4be3:2dbb dev eth0 lladdr 3a:2c:63:23:71:23 STALE
fd00::303a:fdff:fea9:12ba dev eth0  FAILED
fd00::84d6:3e52:2073:c8f9 dev eth0 lladdr b2:17:fc:a5:1b:81 STALE
fe80::1cf5:14f0:172f:799c dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::9a:7d6a:c0d9:7425 dev eth0 lladdr 66:8a:1f:9c:5e:a7 STALE
fd00::cbc:e985:c62c:4c02 dev eth0 lladdr bc:5f:f4:9b:28:66 DELAY
fd00::91a4:a9c7:632b:4513 dev eth0 lladdr 00:e1:8c:a3:5d:f9 STALE
fd00::211:32ff:feb4:e7a7 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
fd00::e08a:b6af:968d:6144 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::8417:32d1:25f1:92e9 dev eth0 lladdr bc:5f:f4:9b:28:66 STALE
fe80::e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router REACHABLE
fe80::14f7:6a9:8f91:5831 dev eth0 lladdr 66:8a:1f:9c:5e:a7 STALE
fe80::8d0:c126:43c0:451a dev eth0 lladdr b2:17:fc:a5:1b:81 STALE
fd00::bce5:d801:2885:fabe dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::1154:7961:dfd:927c dev eth0  FAILED
fd00::1c15:9e1a:70be:7626 dev eth0 lladdr a8:db:03:89:2c:d3 STALE
fd00::dd7d:d728:7c33:2098 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::30f7:18b2:84e8:7c18 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::5558:d212:f24d:db6d dev eth0 lladdr 00:e1:8c:a3:5d:f9 STALE
pi@pihole-rpi:~ $ sudo -u pihole ip n s
192.168.178.26 dev eth0 lladdr bc:5f:f4:9b:28:66 STALE
192.168.178.56 dev eth0 lladdr f2:5b:0c:50:d9:b1 REACHABLE
192.168.178.40 dev eth0 lladdr 00:24:e4:28:fe:6e STALE
192.168.178.54 dev eth0 lladdr 3a:2c:63:23:71:23 STALE
192.168.178.1 dev eth0 lladdr e0:28:6d:3e:6b:8e REACHABLE
192.168.178.76 dev eth0 lladdr 08:12:a5:71:14:e2 STALE
192.168.178.45 dev eth0 lladdr 98:da:c4:08:70:5b STALE
192.168.178.36 dev eth0 lladdr 32:3a:fd:a9:12:ba STALE
192.168.178.43 dev eth0 lladdr 00:17:88:70:5d:da STALE
192.168.178.20 dev eth0 lladdr 50:56:bf:8c:fd:7b REACHABLE
192.168.178.27 dev eth0 lladdr b0:05:94:7b:02:cd STALE
192.168.178.4 dev eth0 lladdr 02:42:c0:a8:b2:04 STALE
192.168.178.41 dev eth0 lladdr 98:da:c4:08:6b:9f STALE
192.168.178.48 dev eth0 lladdr 66:8a:1f:9c:5e:a7 REACHABLE
192.168.178.100 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
192.168.178.55 dev eth0 lladdr f8:54:b8:77:17:c0 STALE
192.168.178.46 dev eth0 lladdr 1c:3b:f3:fb:3f:f1 STALE
192.168.178.53 dev eth0 lladdr b2:17:fc:a5:1b:81 DELAY
192.168.178.23 dev eth0 lladdr 40:b4:cd:6b:e6:79 STALE
192.168.178.30 dev eth0 lladdr b0:5a:da:f6:13:ba STALE
192.168.178.21 dev eth0 lladdr a8:db:03:89:2c:d3 STALE
192.168.178.28 dev eth0 lladdr 00:e1:8c:a3:5d:f9 REACHABLE
fd00::e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
2001:4dd6:9d12:0:e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
fd00::9d6c:875d:156:d04b dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::303a:fdff:fea9:12ba dev eth0  FAILED
fd00::c18e:6fc4:7c70:483e dev eth0  FAILED
fe80::d102:6336:4940:c7ac dev eth0 lladdr a8:db:03:89:2c:d3 STALE
2001:4dd1:78f8:0:e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router STALE
fe80::106d:17c2:b450:d726 dev eth0 lladdr 3a:2c:63:23:71:23 STALE
fe80::211:32ff:feb4:e7a7 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
fd00::ac53:8cf6:cbda:75cb dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::f9:3d71:ea15:5048 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::b0c7:c205:9067:f2c2 dev eth0 lladdr 00:e1:8c:a3:5d:f9 DELAY
fd00::1c9b:85fe:4be3:2dbb dev eth0 lladdr 3a:2c:63:23:71:23 REACHABLE
fd00::303a:fdff:fea9:12ba dev eth0  FAILED
fd00::84d6:3e52:2073:c8f9 dev eth0 lladdr b2:17:fc:a5:1b:81 STALE
fe80::1cf5:14f0:172f:799c dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::9a:7d6a:c0d9:7425 dev eth0 lladdr 66:8a:1f:9c:5e:a7 STALE
fd00::cbc:e985:c62c:4c02 dev eth0 lladdr bc:5f:f4:9b:28:66 REACHABLE
fd00::91a4:a9c7:632b:4513 dev eth0 lladdr 00:e1:8c:a3:5d:f9 STALE
fd00::211:32ff:feb4:e7a7 dev eth0 lladdr 00:11:32:b4:e7:a7 STALE
fd00::e08a:b6af:968d:6144 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::8417:32d1:25f1:92e9 dev eth0 lladdr bc:5f:f4:9b:28:66 REACHABLE
fe80::e228:6dff:fe3e:6b8e dev eth0 lladdr e0:28:6d:3e:6b:8e router REACHABLE
fe80::14f7:6a9:8f91:5831 dev eth0 lladdr 66:8a:1f:9c:5e:a7 STALE
fe80::8d0:c126:43c0:451a dev eth0 lladdr b2:17:fc:a5:1b:81 STALE
fd00::bce5:d801:2885:fabe dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::1154:7961:dfd:927c dev eth0  FAILED
fd00::1c15:9e1a:70be:7626 dev eth0 lladdr a8:db:03:89:2c:d3 STALE
fd00::dd7d:d728:7c33:2098 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fd00::30f7:18b2:84e8:7c18 dev eth0 lladdr f2:5b:0c:50:d9:b1 STALE
fe80::5558:d212:f24d:db6d dev eth0 lladdr 00:e1:8c:a3:5d:f9 STALE

and here the other

pi@pihole-rpi:~ $ systemctl status pihole-FTL
● pihole-FTL.service - LSB: pihole-FTL daemon
   Loaded: loaded (/etc/init.d/pihole-FTL; generated)
   Active: active (exited) since Tue 2021-09-07 06:33:53 CEST; 14h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 12705 ExecStart=/etc/init.d/pihole-FTL start (code=exited, status=0/SUCCESS)

Sep 07 06:33:43 pihole-rpi systemd[1]: Starting LSB: pihole-FTL daemon...
Sep 07 06:33:43 pihole-rpi pihole-FTL[12705]: Not running
Sep 07 06:33:44 pihole-rpi su[12723]: (to pihole) root on none
Sep 07 06:33:44 pihole-rpi su[12723]: pam_unix(su:session): session opened for user pihole by (uid=0)
Sep 07 06:33:53 pihole-rpi pihole-FTL[12705]: FTL started!
Sep 07 06:33:53 pihole-rpi systemd[1]: Started LSB: pihole-FTL daemon.

About your question:

Since it's the armv6hf binary, it is an RPi 1 or other RPi (not sure whether the installer pulls the armv6hf on all Raspbian systems). Official Raspberry Pi OS and installed via official installer?

Yes, like I stated before, it's an RPi 1B, installed from offical website sometime in 2020.

I mean, if it helps, I can set it up new. That's not a problem. I should be able to backup the pihole config and reload it into the new install, right?

MichaIng commented 3 years ago

Okay, first of all all this user session stuff seems to be expected when running su from a systemd unit like that and having PAM + systemd-logind registration and all that active. The question remains why FTL suddenly stopped (or got stopped by the parent shell) without any directly related error message, I'm not sure whether the failing neighbour discoveries are the reason. They don't show up after your restart, so this is a difference before the abortion, but started a while before already. But I've no good idea how to debug this now.

But I totally forgot the filesystem errors I recognised already in the system logs, and the likely related corrupted database logs in FTL. That may be even the reason for the failure in the first place:

Sep 07 01:18:15 pihole-rpi kernel: EXT4-fs (mmcblk0p2): last error at time 1630386007: ext4_mb_generate_buddy:747
Sep 07 01:18:15 pihole-rpi kernel: EXT4-fs (mmcblk0p2): initial error at time 1628719230: ext4_mb_release_inode_pa:3859
Sep 07 01:18:12 pihole-rpi kernel: EXT4-fs (mmcblk0p2): error count since last fsck: 5
...
[2021-09-07 06:39:47.444 12744M] SQLite3 message: database corruption at line 66843 of [5d4c65779d] (11)
[2021-09-07 06:39:47.446 12744M] SQLite3 message: database corruption at line 66993 of [5d4c65779d] (11)
[2021-09-07 06:39:47.447 12744M] SQLite3 message: statement aborts at 33: [SELECT name FROM network_addresses WHERE name IS NOT NULL AND network_id = (SELECT network_id FROM network_addresses WHERE ip = ?) ORDER BY lastSeen DESC LIMIT 1] database disk image i (11)

If you don't require the long-term DNS query database, please move it out of the way (to have a fresh one created) and do an fsck on reboot, like that:

systemctl stop pihole-FTL
mv /etc/pihole/pihole-FTL.db{,_bak}
> forcefsck
reboot
# after reboot, check fsck log
journalctl -t systemd-fsck
# and then latest FTL logs
tail -20 /var/log/pihole-FTL.log
# if everything works well, you may remove the database backup
rm /etc/pihole/pihole-FTL.db_bak
yubiuser commented 3 years ago

[2021-09-07 01:32:03.143 3051/T3056] WARN: Command "ip neigh show" failed! [2021-09-07 01:32:03.144 3051/T3056] Message: Success

I've seen this twice before, it always occurred together with a corrupted database and/or malfomated file system

https://discourse.pi-hole.net/t/lost-connection-to-api/47051 https://discourse.pi-hole.net/t/pi-hole-dns-not-running/49050

MichaIng commented 3 years ago

Okay, indeed that all looks very similar, and solution as well to nuke the database. I still don't understand how this leads to neighbour discovery failures, but probably its only misleadingly printed like that by FTL, when it fails to write related info to the database.

DL6ER commented 3 years ago

I just checked the code and fixed the bug that caused an incorrect "Success" message being printed. If it prints that the command failed, popen for this command returned NULL.

RETURN VALUE
       popen():  on  success,  returns a pointer to an open stream that can be
       used to read or write to the pipe; if  the  fork(2)  or  pipe(2)  calls
       fail, or if the function cannot allocate memory, NULL is returned.

Edit fixed in the beta branch

DL6ER commented 3 years ago

I'd still be very interested in tracking down where exactly FTL hangs with a malformed database. I've never been able to reproduce this myself locally and user provided logs were inconclusive so far - just stopping but at different places. I checked the code in all these places and FTL should just continue to work as expected in all these cases.

pablopoo commented 3 years ago

I can call it a feature, every day at the same timeframe:

[image: image.png]

On Tue, Sep 7, 2021 at 5:05 PM DL6ER @.***> wrote:

I'd still be very interested in tracking down where exactly FTL hangs with a malformed database. I've never been able to reproduce this myself locally and user provided logs were inconclusive so far - just stopping but at different places.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pi-hole/FTL/issues/1082#issuecomment-914588309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADG56S42XI6IFZOLVQ3B423UAZV7FANCNFSM4ZG2EK6A .

MichaIng commented 3 years ago

@pablopoo In your case the database is okay, no related FTL logs and no filesystem related system logs? Or did you try to start with a clean one, just to rule that out?

DL6ER commented 3 years ago

@pablopoo In case you wanted to attach an image - it did not work.

stphn17 commented 3 years ago

today it happened again, here's the FTL log before the crash: pihole-FTL2.txt and here journalctl: journalctl3.txt Maybe this helps to identify the "why" better...

Anyhow, I just executed the mentioned steps, and here's the output:

pi@pihole-rpi:~ $ journalctl -t systemd-fsck
-- Logs begin at Wed 2021-09-08 07:24:34 CEST, end at Wed 2021-09-08 07:28:10 CEST. --
Sep 08 07:24:35 pihole-rpi systemd-fsck[92]: e2fsck 1.44.5 (15-Dec-2018)
Sep 08 07:24:36 pihole-rpi systemd-fsck[92]: rootfs: clean, 46911/1920000 files, 706630/7858944 blocks
Sep 08 07:24:46 pihole-rpi systemd-fsck[155]: fsck.fat 4.1 (2017-01-24)
Sep 08 07:24:46 pihole-rpi systemd-fsck[155]: /dev/mmcblk0p1: 244 files, 109209/516190 clusters
pi@pihole-rpi:~ $ tail -20 /var/log/pihole-FTL.log
[2021-09-08 07:25:09.412 459M]  -> Unique domains: 0
[2021-09-08 07:25:09.413 459M]  -> Unique clients: 0
[2021-09-08 07:25:09.413 459M]  -> Known forward destinations: 0
[2021-09-08 07:25:09.413 459M] Successfully accessed setupVars.conf
[2021-09-08 07:25:09.511 461M] PID of FTL process: 461
[2021-09-08 07:25:09.537 461/T463] Listening on port 4711 for incoming IPv6 telnet connections
[2021-09-08 07:25:09.545 461/T462] Listening on port 4711 for incoming IPv4 telnet connections
[2021-09-08 07:25:09.546 461/T464] Listening on Unix socket
[2021-09-08 07:25:09.549 461M] INFO: FTL is running as user pihole (UID 999)
[2021-09-08 07:25:09.558 461M] Reloading DNS cache
[2021-09-08 07:25:09.575 461M] Blocking status is enabled
[2021-09-08 07:25:10.362 461M] Resizing "FTL-strings" from 40960 to (81920 * 1) == 81920 (/dev/shm: 1.0MB used, 93.6MB total, FTL uses 1.0MB)
[2021-09-08 07:25:10.372 461M] New upstream server: 2001:4dd0:100:1020:53:2:0:2:53 (0/1024)
[2021-09-08 07:25:10.661 461/T465] Compiled 3 whitelist and 0 blacklist regex filters for 1 clients in 37.3 msec
[2021-09-08 07:25:10.967 461M] New upstream server: 192.168.178.1:53 (1/1024)
[2021-09-08 07:25:10.974 461M] New upstream server: 2001:4dd0:200:304:53:2:0:3:53 (2/1024)
[2021-09-08 07:25:13.201 461M] New upstream server: 213.168.112.60:53 (3/1024)
[2021-09-08 07:27:18.685 461M] Resizing "FTL-dns-cache" from 4096 to (512 * 16) == 8192 (/dev/shm: 1.1MB used, 93.6MB total, FTL uses 1.0MB)
[2021-09-08 07:30:09.920 461M] Resizing "FTL-dns-cache" from 8192 to (768 * 16) == 12288 (/dev/shm: 1.1MB used, 93.6MB total, FTL uses 1.1MB)
[2021-09-08 07:31:07.375 461M] Resizing "FTL-domains" from 4096 to (512 * 16) == 8192 (/dev/shm: 1.1MB used, 93.6MB total, FTL uses 1.1MB)

Looks good to me.. No SQL errors so far.. I'll keep monitoring and let you know how it works out! Thank you so much, @MichaIng !

MichaIng commented 3 years ago

Let's hope the database nuke resolves it persistently 🙂. Do you still have the corrupted database? Either you or even the developers might be able to use it for further debugging the underlying issue, if you would be open to provide it (keeping in mind the sensitive DNS query data it contains, hence only via secured channel to an official Pi-hole engineer). But @DL6ER knows best if/how to proceed with this.

The logs sadly show nothing new:

If the issue still appears, or when re-using the corrupted database, a native systemd service might give some more helpful info as it gets a direct exit code from FTL. The SysV service does not "monitor" the FTL process any further as fast as it goes into background. I anyway wanted to commit a native systemd unit, now there is another reason 🙂. Alternatively, FTL could be started in foreground on a dedicated console:

systemctl stop pihole-FTL
su -s /bin/sh -c '/usr/bin/pihole-FTL -f; echo "FTL exit code: $?"' pihole

There are also some debug flags, so the SysV service can sure be tuned as well to provide more output.

DL6ER commented 3 years ago

I will corrupt my own database on my main Pi-hole and will monitor what happens, hoping for an eventual stop in the same situation. If I have this locally, I can do whatever is needed to fix this.

Similarly, it will be interesting to see what

sudo su -s /bin/sh -c "ip neigh show" pihole

returns - hopefully there are some hints for why it fails/refuses to provide the information.

MichaIng commented 3 years ago

I will corrupt my own database

Hopefully it does not require a very specific corruption 🙂.

Similarly, it will be interesting to see what sudo su -s /bin/sh -c "ip neigh show" pihole returns

Above is the output done via sudo -u pihole. Would have been clever indeed to use su to match the SysV service as good as possible, however it is not an issue with plain user permissions at least: https://github.com/pi-hole/FTL/issues/1082#issuecomment-914544254

DL6ER commented 3 years ago

Oh yes, I missed that. Interesting... We'll see with the now (= in the beta and the next release of Pi-hole) fixed message what the actual reason is.

DL6ER commented 3 years ago

The next version of FTL has been released. Please update and run

pihole checkout master

to get back on-track. The fix/feature branch you switched to will not receive any further updates.

Thanks for helping us to make Pi-hole better for us all!

If you have any issues, please either reopen this ticket or (preferably) create a new ticket describing the issues in further detail and only reference this ticket. This will help us to help you best.