pi-hole / web

Pi-hole Dashboard for stats and more
https://pi-hole.net
Other
2.06k stars 563 forks source link

Adding/Deleting multiple Custom DNS Records in a short amount of time triggers systemd service StartLimitBurst leading to disable pihole-FTL service #2606

Open mnlhfr opened 1 year ago

mnlhfr commented 1 year ago

Versions

Platform

Expected behavior

Adding or Deleting custom DNS records through webadmin GUI or respectively directly through POST requests to the API endpoint at /admin/scripts/pi-hole/php/customdns.php causes pihole-FTL to just be reloaded.

Actual behavior / bug

repeatedly adding or deleting DNS records through the web interface (/admin/scripts/pi-hole/php/customdns.php) causes pihole-FTL to be restarted (not reloaded) with each and every DNS record added. This leads to systemd hitting the StartLimitBurst=5 configured in /etc/systemd/system/pihole-FTL.service leading to consecutive restarts of the service to fail.

systemctl status pihole-FTL:

× pihole-FTL.service - Pi-hole FTL
     Loaded: loaded (/etc/systemd/system/pihole-FTL.service; enabled; vendor preset: enabled)
     Active: failed (Result: start-limit-hit) since Fri 2023-06-09 19:26:53 UTC; 17s ago
    Process: 3209 ExecStartPre=/opt/pihole/pihole-FTL-prestart.sh (code=exited, status=0/SUCCESS)
    Process: 3222 ExecStart=/usr/bin/pihole-FTL -f (code=exited, status=0/SUCCESS)
    Process: 3245 ExecStopPost=/opt/pihole/pihole-FTL-poststop.sh (code=exited, status=0/SUCCESS)
   Main PID: 3222 (code=exited, status=0/SUCCESS)
        CPU: 165ms

Jun 09 19:26:53 pihole01 pihole-FTL[3222]: [2023-06-09 19:26:52.992 3222M] Resizing "FTL-queries" from 4587520 to (86016 * 56) == 4816896 (/dev/shm: 5.1MB used, 4.2GB total, FTL uses 5.1MB)
Jun 09 19:26:53 pihole01 pihole-FTL[3222]: [2023-06-09 19:26:52.997 3222M] Resizing "FTL-queries" from 4816896 to (90112 * 56) == 5046272 (/dev/shm: 5.4MB used, 4.2GB total, FTL uses 5.3MB)
Jun 09 19:26:53 pihole01 pihole-FTL[3222]: [2023-06-09 19:26:52.001 3222M] Resizing "FTL-queries" from 5046272 to (94208 * 56) == 5275648 (/dev/shm: 5.6MB used, 4.2GB total, FTL uses 5.6MB)
Jun 09 19:26:53 pihole01 pihole-FTL[3222]: [2023-06-09 19:26:53.007 3222M] Resizing "FTL-queries" from 5275648 to (98304 * 56) == 5505024 (/dev/shm: 5.8MB used, 4.2GB total, FTL u
Jun 09 19:26:53 pihole01 systemd[1]: Stopping Pi-hole FTL...
Jun 09 19:26:53 pihole01 systemd[1]: pihole-FTL.service: Deactivated successfully.
Jun 09 19:26:53 pihole01 systemd[1]: Stopped Pi-hole FTL.
Jun 09 19:26:53 pihole01 systemd[1]: pihole-FTL.service: Start request repeated too quickly.
Jun 09 19:26:53 pihole01 systemd[1]: pihole-FTL.service: Failed with result 'start-limit-hit'.
Jun 09 19:26:53 pihole01 systemd[1]: Failed to start Pi-hole FTL.

Steps to reproduce

Steps to reproduce the behavior:

Shell

  1. run this shell script
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test1.local
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test2.local
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test3.local
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test4.local
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test5.local
    /usr/local/bin/pihole -a addcustomdns 127.0.0.1 test6.local
  2. systemctl status pihole-FTL will now be killed and fail to start with Failed with result 'start-limit-hit'

Webadmin

  1. Add or delete 6 DNS records in under 60 seconds.
  2. systemctl status pihole-FTL will now be killed and fail to start with Failed with result 'start-limit-hit'

Debug Token

I don't think this should be necessary here.

Additional context

I am using the pihole provider from kubernetes-sigs/external-dns in combination with borchero/switchboard to automatically add DNS records for services in my cluster when they get exposed. This worked fine while setting everything up.

However, after exposing more services for external-dns to manage DNS for, pihole-FTL started to behave oddly/refused to start. The way kubernetes-sigs/external-dns adds the dns records, is by just simply sending POST requests to the same endpoint the webadmin GUI uses (/admin/scripts/pi-hole/php/customdns.php). Due to this API endpoint not offering any update functionality and also the fact that no TXT records are possible, this results in a little bit of a "spammy" behaviour from external-dns. As far as i understand the implementation of the pihole provider on external-dns, there are a couple workarounds in place, due to the API limitations of pihole-FTL. There might also be another bug in the pihole provider for external-dns, but I have not yet spent the time to dig into that side any deeper and I believe that even if this was the case, it would not change validity of this bug report.

For a quick and dirty workaround I adjusted StartLimitBurst in /etc/systemd/system/pihole-FTL.service.

When digging through the source code of pihole I noticed a couple things and I am not exactly sure where this should be fixed.

  1. https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L223 empty string as default value for the reload argument.

  2. https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L280 effectively results in pihole -a addcustomdns 127.0.0.1 test1.local (notice neither true nor false are part of the argument for restart)

  3. https://github.com/pi-hole/pi-hole/blob/6a45c6a8e027e1ac30d4556a88f31684bc80ccf1/pihole#L579 pihole shell script defers to AddCustomDNSAddress in webpage.sh

  4. https://github.com/pi-hole/pi-hole/blob/6a45c6a8e027e1ac30d4556a88f31684bc80ccf1/advanced/Scripts/webpage.sh#L719-L743 RestartDNS command will be issued due to missing empty reload argument

  5. https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L284 RestartDNS triggered again. Should this be restartdns reload-lists instead of just restartdns here?

Conclusion

The issue here seems to affect both the AdminLTE as well as the pihole repository, so I am not entirely sure how the best or preferred way of fixing this would look like.

yubiuser commented 1 year ago

Thanks for your detailed analysis, you included all relevant aspects. However, I'm not sure if this is really a bug or not. Pivot is the

StartLimitBurst in /etc/systemd/system/pihole-FTL.service

This was added to prevent endless-start-stop cycles in case of an error. Adding a custom DNS record needs a full FTL restart to take effect. However, if you know you're going to add a lot in a row, $reload can help to postpone the restart until everything is added. This is what we do when we import a (teleporter) backup (see https://github.com/pi-hole/AdminLTE/pull/2519). I'm not familiar with the tools you use to add the records, but as the access the API endpoints directly they always trigger the restart immediately. I have no good idea how to solve (as in balance) between preventing to-many-restarts and being able to (manually) adding a lot of custom DNS records (other than increasing the limit). I think no one of us thought about adding >5 records manually within 60 seconds.

@DL6ER I guess the same issue will happen in v6?

mnlhfr commented 1 year ago

thanks for your quick reply!

I initially thought that a pihole restartdns reload-lists would be enough, but I seem to have missed that during my testing earlier.

However, as far as I can tell it is enough to issue a systemctl reload, which does not cause any issues with the service, even if triggered many times in quick succession.

  1. try to resolv custom DNS record that does not exist
    
    # nslookup testing-01.local localhost
    Server:         localhost
    Address:        127.0.0.1#53

** server can't find testing-01.local: NXDOMAIN

2. add the record using cmd utility with reload=false

pihole -a addcustomdns 127.0.0.1 testing-01.local false

[✓] Adding custom DNS entry...

3. record still not in effect

nslookup testing-01.local localhost

Server: localhost Address: 127.0.0.1#53

** server can't find testing-01.local: NXDOMAIN

4. i initially thought that restartdns reload-lists would do the job, but:

pihole restartdns reload-lists

[✓] Reloading DNS lists

nslookup testing-01.local localhost

Server: localhost Address: 127.0.0.1#53

** server can't find testing-01.local: NXDOMAIN

5. However, if we reload the service using systemctl:

systemctl reload pihole-FTL

6. the record is active now

nslookup testing-01.local localhost

Server: localhost Address: 127.0.0.1#53

Name: testing-01.local Address: 127.0.0.1


So, just reloading the service with systemctl seems to be enough and doesn't seem to trigger any issues with the service:
1. StartLimitBurst is set to the default value of 5

systemctl cat pihole-FTL | grep StartLimitBurst

StartLimitBurst=5

2. reload the service 100 times

for i in $(seq 1 100); do systemctl reload pihole-FTL; done

4. service still in active state

systemctl status pihole-FTL

● pihole-FTL.service - Pi-hole FTL Loaded: loaded (/etc/systemd/system/pihole-FTL.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2023-06-09 21:19:01 UTC; 1min 43s ago Process: 7317 ExecStartPre=/opt/pihole/pihole-FTL-prestart.sh (code=exited, status=0/SUCCESS) Process: 7450 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Main PID: 7330 (pihole-FTL) Tasks: 19 (limit: 9401) Memory: 10.4M CPU: 279ms CGroup: /system.slice/pihole-FTL.service └─7330 /usr/bin/pihole-FTL -f

Jun 09 21:20:39 pihole01 systemd[1]: Reloading Pi-hole FTL... Jun 09 21:20:39 pihole01 systemd[1]: Reloaded Pi-hole FTL. Jun 09 21:20:39 pihole01 systemd[1]: Reloading Pi-hole FTL... Jun 09 21:20:39 pihole01 systemd[1]: Reloaded Pi-hole FTL. Jun 09 21:20:39 pihole01 systemd[1]: Reloading Pi-hole FTL... Jun 09 21:20:39 pihole01 systemd[1]: Reloaded Pi-hole FTL. Jun 09 21:20:39 pihole01 systemd[1]: Reloading Pi-hole FTL... Jun 09 21:20:39 pihole01 systemd[1]: Reloaded Pi-hole FTL. Jun 09 21:20:39 pihole01 systemd[1]: Reloading Pi-hole FTL... Jun 09 21:20:39 pihole01 systemd[1]: Reloaded Pi-hole FTL.

rdwebdesign commented 1 year ago

Your analysis in not correct here:

https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L280 effectively results in pihole -a addcustomdns 127.0.0.1 test1.local (notice neither true nor false are part of the argument for restart)

When needed, $reload will be set to false on the lines just above the code you posted:

https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L274-L281

Resulting in: pihole -a addcustomdns 127.0.0.1 test1.local false

mnlhfr commented 1 year ago

Your analysis in not correct here:

https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L280

effectively results in pihole -a addcustomdns 127.0.0.1 test1.local (notice neither true nor false are part of the argument for restart)

When needed, $reload will be set to false on the lines just above the code you posted:

https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L274-L281

Resulting in: pihole -a addcustomdns 127.0.0.1 test1.local false

you are correct, thanks for pointing that out! and sorry for the confusing!

yubiuser commented 1 year ago

However, as far as I can tell it is enough to issue a systemctl reload, which does not cause any issues with the service, even if triggered many times in quick succession.

I think you are correct. The reload triggers the read of custom.list by dnsmasq

Jun  9 23:35:39 dnsmasq[3578109]: read /etc/hosts - 7 names
Jun  9 23:35:39 dnsmasq[3578109]: read /etc/pihole/custom.list - 22 names
Jun  9 23:35:39 dnsmasq[3578109]: read /etc/pihole/local.list - 0 names

Note: this will only work for the custom DNS records, not custom CNAME records (as files in /etc/dnsmasq.d/ are not read by reload)

yubiuser commented 1 year ago

Please try if

pihole checkout core no_reload fixes the issue for you.

mnlhfr commented 1 year ago

Please try if

pihole checkout core no_reload fixes the issue for you.

not quite, but when I additionally add the reload to the func.php as well, it seems to be working: https://github.com/pi-hole/AdminLTE/blob/3a11976ee8ecc50e2dd9efd76caad1ad41894dd5/scripts/pi-hole/php/func.php#L284

thanks!

yubiuser commented 1 year ago

Good catch.

We did not plan to release any new v5 version and focus on v6 - however the changes necessary here are trivial after the bug was dissected. We'll discuss internally how to proceed.

DL6ER commented 1 year ago

I guess the same issue will happen in v6?

No. v6 tries to be "more clever" than many of the elements we have in v5. This involves trying to minimize restarting of FTL to as seldom as possible. Custom DNS records are a prime example. When I coded this part of the v6 interface, I very much disliked that the DNS cache is completely flushed by a restart/reload so I changed two things: Firstly, this file is now in a watched directors (hostsdir) so FTL finds itself out when the file is changed (no need to send a signal at all). Secondly, I submitted a patch upstream into dnsmasq (it is already included in the current release of dnsmasq) that ensures the DNS cache is not completely flushed but only those elements from the updated list are removed (and then repopulated).

yubiuser commented 1 year ago

After internal discussion we decided to not release a new v5 version. Reasons are

BoKKeR commented 10 months ago

I have posted a workaround for v5 in this thread https://discourse.pi-hole.net/t/ftl-crashes-repeatedly-when-updating-dns-records-through-external-dns/66867