FTL high CPU and unresponsive after large amount of unresolvable address requests

pi-hole / FTL

The Pi-hole FTL engine

https://pi-hole.net

Other

1.38k stars 196 forks source link

FTL high CPU and unresponsive after large amount of unresolvable address requests #1493

Closed stheine closed 1 year ago

stheine commented 1 year ago

Versions

Pi-hole: 5.14.1
AdminLTE: 5.17
FTL: 5.19.1

Platform

OS and version: docker image pihole/pihole:latest
Platform: Raspberry Pi, Docker

Expected behavior

Work fine, long term, even when a lot of non-resolvable addresses are requested, not rejecting DNS requests (for rate_limit), nor going into unresponsive state (pihole-FTL large CPU).

Actual behavior / bug

I use a VPN connected to my office network. the is a office internal DNS server, resolving office-internal hostname requests, and while the VPN is connected, this is the single DNS server used in the computer. I have a number of applications that make regular requests to servers in the office network, and therefore making regular DNS requests. the computer running the VPN and the apps is running permanently, unmonitored for some time (at night / one the weekend).

at intermittent times, the VPN might go down, without me realizing for several hours. the next time, I get to the computer, I realize that DNS requests are not getting resolved any more (not only office internal ones, but also requests for public addresses), but run into timeouts.

In the past, after the original situation had happened (VPN goes down, causing a lot of unresolvable DNS requests), I had found the computer's DNS requests being rejected by pihole and detemined this being caused by the rate_limit - so in the meantime I disabled this by setting rate_limit=0/0.

investigating that situation, I find that pihole is not properly running any more:

the pihole-FTL process is at nearly 100% of CPU.
the pihole-FTP.db file is large, tens to hundreds of MBs.
the pihole admin UI is sluggish, when trying to bring up the main stats panel.

Steps to reproduce

I guess, this should be reproducible, by sending ten-thousands of unresolvable DNS requests per hour for several hours.

Debug Token

URL: https://tricorder.pi-hole.net/khqumSVR/

Note: this debug data has been generated, after stopping pihole (docker), renaming the pihole-FTL.db, pulling the latest docker image (updating FTL from 5.19.1 to 5.19.2), and restarting pihole. but the configuration is unchanged, and the VPN hasn't been restarted, so the debug data contains a number of those unresolvable requests again (just not the amount that had caused the reported issue, after being in that situation for ~18 hours).

Screenshots

Additional context

Add any other context about the problem here.

DL6ER commented 1 year ago

Thanks for your report and sorry for the inconvenience!

I guess, this should be reproducible, by sending ten-thousands of unresolvable DNS requests per hour for several hours.

Well, that's not necessarily "easy" to do but, yeah, I get what you mean.

Do you, by any chance, still have the file /var/log/pihole/pihole.log and /var/log/pihole/FTL.log (possibly rotated away to suffixes .1, etc. in the same directory)?

Can you give us some of the lines (there may be none in FTL.log but we don't know yet) from these files *around the time of the VPN fail` so we can check what is going on?

stheine commented 1 year ago

Well, that's not necessarily "easy" to do but, yeah, I get what you mean.

I can make the situation appear on demand, by just killing the VPN client manually, while leaving the apps running, and then wait for some hours for FTP to 'queue up'. this is just not something I would want to do during my normal work hours - but if you want, I can trigger that tonight.

Do you, by any chance, still have the file /var/log/pihole/pihole.log and /var/log/pihole/FTL.log (possibly rotated away to suffixes .1, etc. in the same directory)?

I don't have them from yesterday, when the situation was triggered, as they are not in the docker persistent data, so they got removed when I restarted (since I also pulled and created a new docker container). but I have them today, after I restarted pihole while the VPN was still disconnected, so it's pretty much the same situation as if the VPN just failed, and pihole/ FTL is starting to build up the bad situation: FTL.log.gz pihole.log.gz

and now, that I log at these logfile, I realize, that I forgot to mention one more special config that I added after the initial issue (the computer getting rejected for the rate_limit), and that might be relevant for understanding the situation: not only I set the rate_limit=0/0, but also I added local DNS records into pihole, resolving the office-internal hostnames to 127.0.0.1. I tried to make pihole be able to resolve those addresses quickly (instead of trying to resolve them via upstream DNS, which would fail anyhow).

DL6ER commented 1 year ago

Little less than 20,000 queries per 10 minutes (from your graph screenshot above) is something like 32 queries per second. This number is confirmed in your pihole.log (e.g., exactly 32 queries at Nov 21 07:24:23). This should still be doable, even on Raspberry Pi hardware. So much for the load - it should work.

There is no rate-limiting going on and also no queries are getting forwarded as you have configured the local DNS record. Still, this one client you have is querying like hell, aside from what we find here it might be worth trying to reduce the querying amount in case of VPN failure. It seems to me that 127.0.0.1 isn't acceptable as answer so the client immediately retries DNS instead of even trying to connect upstream (or it has a really low timeout of less than 30 msec). Maybe providing an answer to an otherwise unresolvable address (like 192.168.255.250 or in the 10.x regime) would help. You could similarly provide a LL (fe80::1234) or better ULA (like fd12:3456:789a:1::1) address for IPv6.

is not properly running any more

I'm not sure at this point if this wants to say "it stops working" (we know it does) or if this even wants to say "it worked in an earlier version". What we could try here is to narrow down why the CPU starts spinning. I have to suggestions that can be done separately or even at the same time (but see caveat at the very end):

1. Find out which process is actually spinning

You may already know all this, but I'm leaving it here anyway for reference. Using htop you can enable custom thread names and tree view (F2 is the settings menu): Screenshot from 2022-11-21 10-39-40

Thereafter, you will see pihole-FTL's components separated by threads which proper CPU attributions: Screenshot from 2022-11-21 10-41-31

You can quickly find pihole-FTL using htop's search mechanism accessible via the key /.

2. Enable FTL's debug logging

FTL's internal logging does, in contrast to the dnsmasq log in pihole.log include millisecond-timestamps and will provide some more insight into internal timing. For this, please add

DEBUG_QUERIES=true
DEBUG_DATABASE=true

to the file /etc/pihole/pihole-FTL.conf (create if it does not exist) and run

pihole restartdns

Next time you see the issue upload again some log lines around the time of the CPU spinning at 100%. It is worth noting that debug logging will cause additional work and delays depending on the speed of your disk/SD card. I hope it will not be too extreme to influence timing of FTL but - if not - it will provide important additional information. Note that debug logging will also cause additional disk usage. At a rate of ~120,000 queries per hour this can quickly escalate to a few gigabytes of log so better make sure you have enough disk space available or maybe plug in a USB thumbdrive with sufficient free space (you can move the FTL log file setting a custom LOGFILE= in /etc/pihole/pihole-FTL.conf).

stheine commented 1 year ago

Here is a one-hour extract (06:00 - 06:59) of the FTL.log, with DEBUG enabled, showing the state after I disabled the VPN last night. see if that gives you some indication of what the process was doing. FTL_06.log.gz It had not been in the completely-unresponsive state this morning, though - but that might have been caused by my computer not being in the exact same state as earlier, but still the system had been busy as in the past:

investigating that situation, I find that pihole is not properly running any more:

when I stated that earlier, I meant to say that the current instance of pihole was working fine for days or weeks, until my environment caused the issue by triggering the 'VPN goes down and some processes start to do a lot of DNS requests for unresolvable addresses'. the issue is not related to a specific version of pihole. I know I had experienced that in earlier versions as well - that's when I started playing with the rate_limit and local DNS entries, trying to overcome this situation.

it might be worth trying to reduce the querying amount in case of VPN failure.

I can try looking at that, as far as I have these tools under control...

DL6ER commented 1 year ago

Looking at your log, I can see no overwhelming situation of FTL (but you mentioned this already). Queries are replied to pretty fast (typically 1-2 ms) and there was always a certain period of a few msec complete inactivity before the next query came in and was immediately processed. I guess we have to wait for the

completely-unresponsive state

and also what your htop says at that time to see where we have to look next.

stheine commented 1 year ago

It happened again. here's some more information I gathered now. note, in the screenshot, FTL is showing 100% CPU, but in reality it's going up and down, between 6% and 100%.

docker logs showing theses latest entries:

the pihole-FTL.db is 140M, the FTL.log 787M, the pihole.log 190M, just the logs of today. I'm giving you extracts of the last ~90 minutes. FTL_14.log.gz pihole_14.log.gz

in the FTL.log, I find Error while trying to close database: database is locked. I find the first appearance of that message at 12.36, so I'm giving another extract of that timeframe. FTL_12_3.log.gz

please see if you can find something useful from these logs.

DL6ER commented 1 year ago

I wonder why you are not seeing any of FTL's threads, do you have the "Hide userland process threads" option enabled in htop (recommended and default should only have Hide kernel threads enabled)?

At first I was confused why the log contains so many FTL restarts but then I discovered quite a few No space left on device ("quite a few" = 104 in the log file). "device" might be misleading here (it is a system-provided error message) - what is meant here is memory.

We see warnings before the issue happens in the log (they have been there also without debugging before):

[2022-11-26 15:42:00.385 32683M] WARNING: RAM shortage (/dev/shm) ahead: 99% is used (/dev/shm: 67.0MB used, 67.1MB total, FTL uses 67.0MB)

When FTL runs out of memory, the docker container restarts it. When this happens, FTL re-imports all queries from the database and almost immediately hits the memory limit again, quits, and is restarted. This seems to be repeated in kind of an endless loop on your system. At least that's my current working theory. The log does not show anything odd otherwise but this endless restarting is very likely the cause for the intermittent to completely stopped DNS resolution you've seen. This kind of reminded me of https://github.com/pi-hole/docker-pi-hole/issues/571 where a similar issue was triggered by a similar flood of queries.

I'd have two possibilities to offer:

Increase shared memory size for the container (like shm_size: '1gb' if are usingdocker compose`)
Disable the database by setting MAXDBDAYS=0 in /etc/pihole/pihole-FTL.conf. This will not prevent the periodic restarting but it will, at the very least, prevent the immediate re-restart because no history will be reloaded after FTL is restarted.

Obviously, no. 1 will be the correct solution (prevent the forced restart in the first place) but if you happen to not have enough memory available on your docker host system (for whatever reason), you can still try option no. 2. If you don't really need any data that is older than 24 hours (and don't mind loosing history on FTL restarts), you could also apply both changes at the same time.

You should also disable debug logging, it turned out we didn't need it to solve this mystery.

stheine commented 1 year ago

Thanks for your analysis. I've made the MAXDBDAYS change, expecting that the automatic restart will then not keep the system down for long. I don't want to extend the shared memory, since the docker host is running other apps as well, and - if I get it right - it would just extend the time until FTL could go down again.

I hope, this way the environment will remain stable now. again, thanks for your inputs!

DL6ER commented 1 year ago

it would just extend the time until FTL could go down again.

Yes and No. But actually more of the No than of the Yes.

FTL keeps queries in memory for 24 hours. The more queries it has in memory, the more memory it consumes. From your logs, we have seen that FTL had about 1.4 million queries in memory when it used up the available 64 MB. Given your query rate of about 190,000 queries per hour, memory is used up after about 12 hours. If you double the available memory (say, 128 MB), there should be enough room for all queries (remember FTL only keeps 24 hours of history in memory). If you want to go on the safe side, you could go for 200 MB and never see any restart again.

I de see what you mean when you say that the docker host is running multiple instances, however, shared memory is a particular method of inter-process communication not used by many applications out there and increasing the (maximum!) shared memory allocation shouldn't be an issue. Compare to, e.g., Firefox in docker containers. They frequently ask for setting shm_size: 2gb to even be able to open more than three tabs.

TL;DR: Increasing memory will stop the restarting. It is likely that nothing on your host is even using shared memory besides Pi-hole (check with df -h /dev/shm).

stheine commented 1 year ago

ok, I get your point. so I increased the shared mem to 200MB and remove the MAXDBDAYS for now. and I'll keep an eye on the memory usage on the raspi - there is not that much memory in the system.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.