Details

Note: All of this started out of the blue. No updates were made that triggered it. My linux router running this does not self-update. I didn't update the pihole using docker-compose, and I didn't change any configuration options.

The pihole stops responding to DNS requests, as well as DHCP requests, on a seemingly random basis. The web interface also won't load fully. The stalls happen at various times throughout the day, in both low and high usage. There is not unusual levels of usage - since starting the service with fresh data 5 hours ago, there have been 5300 requests, ~20% blocked, and one lock-up. When stopping, the docker container sometimes seems to start a second pihole-FTL --no-daemon process by itself. Sometimes it self-resolves, sometimes it doesn't. A docker-compose restart fixes it every time, for a seemingly arbitrary period. The pihole-FTL.log file always ends with DB/GC thread locks waiting. See details section for more info. I'm on the :latest tagged docker image. I'm about to try the v5.6 tag to see if I can still reproduce. I'll report my findings back here and update.

EDIT: v5.6 also has the issue for me

Related Issues

[x] I have searched this repository/Pi-hole forums for existing issues and pull requests that look similar

How to reproduce the issue

Environment data
- Operating System: Alpine
- Hardware: x86 Router (PCEngines APU2, 4GB Ram, SSD)
- Kernel Architecture: x86/amd64
- Docker Install Info and version:
- Software source: OS provided package
- Supplimentary Software: N/A
- Hardware architecture: x86
docker-compose.yml contents, docker run shell command, or paste a screenshot of any UI based configuration of containers here
```
version: "3"
```

More info at https://github.com/pi-hole/docker-pi-hole/ and https://docs.pi-hole.net/

services: pihole: container_name: pihole image: pihole/pihole:latest environment: TZ: 'Australia/Perth' WEBPASSWORD: 'ought immerse uninvited promotion' DNS1: 1.1.1.1 DNS2: 1.0.0.1 DNSSEC: 'true' ports:

"53:53/tcp"
"53:53/udp"
"43080:80/tcp" dns:
127.0.0.1
1.1.1.1 volumes:
./data/pihole:/etc/pihole/
./data/dnsmasq.d/:/etc/dnsmasq.d/ cap_add:
NET_ADMIN restart: unless-stopped networks: pihole: ipv4_address: 172.20.0.10 networks: pihole: ipam: config:
- subnet: 172.20.0.0/16

any additional info to help reproduce When the pihole stops responding, the following also occurs:
- The FTL Log file is always ended with:
```
[2021-02-28 10:27:00.027 621/T640] Waiting for lock in GC_thread() (/root/project/src/gc.c:50)
[2021-02-28 10:27:00.045 621/T639] Waiting for lock in DB_thread() (/root/project/src/database/database-thread.c:52
```
- The issue with sometimes self resolve after 3-4 minutes with the message TCP worker terminating (timeout), other times it will hard-lock for ever and need restarting
- Within the docker container, a new process for pihole-FTL will sometimes start alongside the first one. This happens after the lock up
- I have completely wiped all data with docker-compose down, docker-compose rm -sfv, rm -rf ./data (to kill all configs), then docker-compose up --build --force-rebuild. The same issue persisted after some hours
- Sometimes it will run for hours before failing, sometimes minutes. A docker-restart always fixes the issue rapidly.
- The db files have been at 2GB when failing, as well as at 11MB, as well as at 500KB. There seems to be no link with how much usage it's getting - sometimes it fails with minimal usage on the network, sometimes with lots
- The lock-ups seem to have happened consistently on the minute, rather than, say 17:42:36. I'm assuming this is because it's writing to the database every minute, and stalling attempting to get a lock
- I'm running both DHCP and DNC on the pihole, and both services stop working, and the web service stops responding, all at the same time. Other services on the same machine (unifi controller) continue working during this time. I assume the web service attempts to access the DB when loading and can't get access to it, which is why it's failing
- After completely wiping everything and building fresh, I only go and re-enable DHCP and set the range. No other changes are made.
- There are no custom adlists or anything of the sort - all stock.
- I saw a few reports about DNSSEC with Cloudflare being an issue a while ago - I tried disabling DNSSEC with no change. The stalls continued with exactly the same symptoms.
- I tried using gdb to debug. When the pihole stalled, nothing changed in the process. There is no crash, so I wasn't able to get any useful information. The process stays alive, but stops responding entirely. I assume this is because it's waiting for locks.

These common fixes didn't work for my issue

[x] I have tried removing/destroying my container, and re-creating a new container
[x] I have tried fresh volume data by backing up and moving/removing the old volume data
[ ] I have tried running the stock docker run example(s) in the readme (removing any customizations I added)
[x] I have tried a newer or older version of Docker Pi-hole (depending what version the issue started in for me)
[x] I have tried running without my volume data mounts to eliminate volumes as the cause

If the above debugging / fixes revealed any new information note it here.

Volumes make no difference - it fails with or without them
:v5.6 and :latest both present the same issues
multiple full rebuilds, including docker prune -a and re-pulling literally everything, still failing after each attempt.

Add any other debugging steps you've taken or theories on root cause that may help.

The messages seem to point to https://github.com/pi-hole/FTL/blob/2999e2b57c62b4455187ee9b77840d49df0a8e2e/src/database/database-thread.c#L52 and then to https://github.com/pi-hole/FTL/blob/2999e2b57c62b4455187ee9b77840d49df0a8e2e/src/shmem.c#L347 as the message before it prints, but the message after it does not. Seems like it's trying to get a lock on some shared memory, but it never returns. My knowledge of C is a little rusty at this point, but it seems like it's trying to lock the "FTL-lock" shared memory and it never returns. According to https://linux.die.net/man/3/pthread_mutex_unlock the mutex will just block until it becomes available. So I support something is failing to release a lock somewhere?

it looks like it's always stopping after Querying gravity database for MAC address of... which is not showing the next log, expected to be a result like "No record found". This happens at https://github.com/pi-hole/FTL/blob/fb9b2744b411cc1a04527b909675447fdab8a471/src/database/gravity-db.c#L333

Still following the thread after this. Running the query in getMACfromIP doesn't seem to result in any issues in sqlite3

alexmaras commented 3 years ago

Based on this debug output, it seems like the following line locks, but doesn't remove (it never logs Removed lock in _FTL_new_query) - https://github.com/pi-hole/FTL/blob/2999e2b57c62b4455187ee9b77840d49df0a8e2e/src/dnsmasq_interface.c#L571

This particular instance resolved after 4 minutes (see the TCP worker terminating (timeout) line) but exactly the same thing happened on the one that never recovered:

I'm going through the dnsmasq_interface.c code to see if I can see it either somehow returning without unlocking, or maybe stalling somewhere and never getting to an unlock line

alexmaras commented 3 years ago

Debugging with DEBUG_DATABASE now to see if it maybe doesn't manage to transact in sqlite. I've realised that it actually doesn't lock up on the minute - that's just when a scheduled thing will occur (the writing to the DB) and the rest is blocked as _FTL_new_query waits. So it seems arbitrary when it stops.

alexmaras commented 3 years ago

I've also realised that my docker-compose and iptables combination allowed external traffic into my pihole on port 53. I've now bound the port to the internal IP of the router in order to prevent this. Oddly, all of the failures I can see were preceded by a request from an external IP. Obviously not ideal.

I've been able to now check that I can't dig to my static IP from outside my network, so I'll see if the problem persists.

EDIT: So far, I haven't had any further drops. I'm going to continue monitoring, but that might be it. If this is the case, I'd be intrigued to know what kinds of DNS queries these external IPs were making that could possibly cause the pihole to stall. There seems to be no particular pattern to it, and there were only about 8 separate IPs that had connected, with the most having around 500 queries. Not a huge amount.

I would also suggest that the readme get updated to suggest users bind to an IP specifically when attaching port 53. https://docs.docker.com/network/iptables/#setting-the-default-bind-address-for-containers Docker is pretty heavy-handed with the way it changes iptables rules, and it had bypassed my existing rules for external access by allowing all to forward to that port, therefore having my external IP's port 53 universally available. My stupid mistake obviously, but this has taught me another valuable lesson about docker and iptables, so that's nice.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.

PromoFaux commented 2 years ago

Is this still an issue you are seeing in more recent releases? Feel free to re-open if so

As a side note:

When stopping, the docker container sometimes seems to start a second pihole-FTL --no-daemon process by itself.

I think that may be fixed in the :dev tag, I added some guarding to the service wrapper recently - not sure if it is related, mind!

pi-hole / docker-pi-hole

Stops responding to DNS, DHCP, Web Requests - Database appears to be unable to gain lock #810

Details

Related Issues

How to reproduce the issue

More info at https://github.com/pi-hole/docker-pi-hole/ and https://docs.pi-hole.net/

These common fixes didn't work for my issue