pihole doesn't respond to DNS queries without container restart

eiddor commented 3 years ago

After a full restart of my RPi3 (shutdown and power off/on), pihole doesn't respond to DNS queries until I restart the pihole container again. Logs/testing output below:

Logs from power-on:

20.02.21 12:50:19 (-0600) Supervisor starting
20.02.21 12:49:55 (-0600)  pihole  [s6-init] making user provided files available at /var/run/s6/etc...exited 0.
20.02.21 12:49:56 (-0600)  pihole  [s6-init] ensuring user provided files have correct perms...exited 0.
20.02.21 12:49:56 (-0600)  pihole  [fix-attrs.d] applying ownership & permissions fixes...
20.02.21 12:49:56 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: applying... 
20.02.21 12:49:56 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: exited 0.
20.02.21 12:49:56 (-0600)  pihole  [fix-attrs.d] done.
20.02.21 12:49:56 (-0600)  pihole  [cont-init.d] executing container initialization scripts...
20.02.21 12:49:56 (-0600)  pihole  [cont-init.d] 20-start.sh: executing... 
20.02.21 12:49:57 (-0600)  pihole   ::: Starting docker specific checks & setup for docker pihole/pihole
20.02.21 12:49:57 (-0600)  pihole  
20.02.21 12:49:57 (-0600)  pihole    [i] Installing configs from /etc/.pihole...
20.02.21 12:49:58 (-0600)  pihole    [i] Existing dnsmasq.conf found... it is not a Pi-hole file, leaving alone!
20.02.21 12:49:58 (-0600)  pihole    [i] Copying 01-pi  [â] Copying 01-pihole.conf to /etc/dnsmasq.d/01-pihole.conf
20.02.21 12:49:59 (-0600)  pihole  Converting DNS1 to PIHOLE_DNS_
20.02.21 12:49:59 (-0600)  pihole  Converting DNS2 to PIHOLE_DNS_
20.02.21 12:49:59 (-0600)  pihole  Setting DNS servers based on PIHOLE_DNS_ variable
20.02.21 12:49:59 (-0600)  pihole  ::: Pre existing WEBPASSWORD found
20.02.21 12:49:59 (-0600)  pihole  DNSMasq binding to default interface: eth0
20.02.21 12:50:00 (-0600)  pihole  Added ENV to php:
20.02.21 12:50:00 (-0600)  pihole                    "PHP_ERROR_LOG" => "/var/log/lighttpd/error.log",
20.02.21 12:50:00 (-0600)  pihole                    "ServerIP" => "0.0.0.0",
20.02.21 12:50:00 (-0600)  pihole                    "VIRTUAL_HOST" => "0.0.0.0",
20.02.21 12:50:00 (-0600)  pihole  Using IPv4 and IPv6
20.02.21 12:50:00 (-0600)  pihole  ::: Preexisting ad list /etc/pihole/adlists.list detected ((exiting setup_blocklists early))
20.02.21 12:50:00 (-0600)  pihole  https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
20.02.21 12:50:00 (-0600)  pihole  ::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Temporary failure in name resolution
20.02.21 12:50:06 (-0600)  pihole  FTL started!
20.02.21 12:50:07 (-0600)  pihole  ::: Testing lighttpd config: Syntax OK
20.02.21 12:50:07 (-0600)  pihole  ::: All config checks passed, cleared for startup ...
20.02.21 12:50:07 (-0600)  pihole  ::: Enabling Query Logging
20.02.21 12:50:07 (-0600)  pihole    [i] Enabling logging...
20.02.21 12:50:07 (-0600)  pihole    [i] Restarting DN  [â] Restarting DNS server
  [â] Logging has been enabled!
20.02.21 12:50:07 (-0600)  pihole   ::: Docker start setup complete
20.02.21 12:50:08 (-0600)  pihole    [i] Neutrino emissions detected...
  [â] Pulling blocklist source list into range
20.02.21 12:50:08 (-0600)  pihole  
20.02.21 12:50:08 (-0600)  pihole    [i] Preparing new  [â] Preparing new gravity database
20.02.21 12:50:08 (-0600)  pihole    [i] Using libz compression
20.02.21 12:50:08 (-0600)  pihole  
20.02.21 12:50:08 (-0600)  pihole    [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
20.02.21 12:50:08 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:08 (-0600)  pihole    [â] List download failed: using previously cached list
20.02.21 12:50:10 (-0600)  pihole    [i] Received 62904 domains
20.02.21 12:50:10 (-0600)  pihole  
20.02.21 12:50:10 (-0600)  pihole    [i] Target: https://mirror1.malwaredomains.com/files/justdomains
20.02.21 12:50:10 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:10 (-0600)  pihole    [â] List download failed: no cached list available
20.02.21 12:50:10 (-0600)  pihole  
20.02.21 12:50:10 (-0600)  pihole    [i] Target: http://sysctl.org/cameleon/hosts
20.02.21 12:50:10 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:10 (-0600)  pihole    [â] List download failed: using previously cached list
20.02.21 12:50:11 (-0600)  pihole    [i] Received 20566 domains
20.02.21 12:50:11 (-0600)  pihole  
20.02.21 12:50:11 (-0600)  pihole    [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
20.02.21 12:50:11 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:11 (-0600)  pihole    [â] List download failed: using previously cached list
20.02.21 12:50:11 (-0600)  pihole    [i] Received 0 domains
20.02.21 12:50:11 (-0600)  pihole  
20.02.21 12:50:11 (-0600)  pihole    [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
20.02.21 12:50:12 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:12 (-0600)  pihole    [â] List download failed: using previously cached list
20.02.21 12:50:12 (-0600)  pihole    [i] Received 34 domains
20.02.21 12:50:12 (-0600)  pihole  
20.02.21 12:50:12 (-0600)  pihole    [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
20.02.21 12:50:12 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:12 (-0600)  pihole    [â] List download failed: using previously cached list
20.02.21 12:50:12 (-0600)  pihole    [i] Received 2701 domains
20.02.21 12:50:12 (-0600)  pihole  
20.02.21 12:50:12 (-0600)  pihole    [i] Target: https://hosts-file.net/ad_servers.txt
20.02.21 12:50:13 (-0600)  pihole    [i] Status: Pendi  [â] Status: Connection Refused
20.02.21 12:50:13 (-0600)  pihole    [â] List download failed: no cached list available
20.02.21 12:50:13 (-0600)  pihole  
20.02.21 12:50:14 (-0600)  pihole    [i] Storing downl  [â] Storing downloaded domains in new gravity database
20.02.21 12:50:15 (-0600)  pihole    [i] Building tree  [â] Building tree
20.02.21 12:50:15 (-0600)  pihole    [i] Swapping data  [â] Swapping databases
20.02.21 12:50:17 (-0600)  pihole    [i] Number of gravity domains: 86205 (74489 unique domains)
20.02.21 12:50:17 (-0600)  pihole    [i] Number of exact blacklisted domains: 1
20.02.21 12:50:17 (-0600)  pihole    [i] Number of regex blacklist filters: 0
20.02.21 12:50:17 (-0600)  pihole    [i] Number of exact whitelisted domains: 108
20.02.21 12:50:17 (-0600)  pihole    [i] Number of regex whitelist filters: 0
20.02.21 12:50:17 (-0600)  pihole    [i] Flushing DNS   [â] Flushing DNS cache
20.02.21 12:50:17 (-0600)  pihole    [i] Cleaning up s  [â] Cleaning up stray matter
20.02.21 12:50:17 (-0600)  pihole  
20.02.21 12:50:18 (-0600)  pihole    [â] DNS service is listening
20.02.21 12:50:18 (-0600)  pihole       [â] UDP (IPv4)
20.02.21 12:50:18 (-0600)  pihole       [â] TCP (IPv4)
20.02.21 12:50:18 (-0600)  pihole       [â] UDP (IPv6)
20.02.21 12:50:18 (-0600)  pihole       [â] TCP (IPv6)
20.02.21 12:50:18 (-0600)  pihole  
20.02.21 12:50:18 (-0600)  pihole    [â] Pi-hole blocking is enabled
20.02.21 12:50:18 (-0600)  pihole    Pi-hole version is v5.2.4 (Latest: v5.2.4)
20.02.21 12:50:18 (-0600)  pihole    AdminLTE version is v5.4 (Latest: v5.4)
20.02.21 12:50:18 (-0600)  pihole    FTL version is v5.7 (Latest: v5.7)
20.02.21 12:50:18 (-0600)  pihole  [cont-init.d] 20-start.sh: exited 0.
20.02.21 12:50:18 (-0600)  pihole  [cont-init.d] done.
20.02.21 12:50:18 (-0600)  pihole  [services.d] starting services
20.02.21 12:50:18 (-0600)  pihole  Starting crond
20.02.21 12:50:18 (-0600)  pihole  Starting lighttpd
20.02.21 12:50:18 (-0600)  pihole  Starting PADD
20.02.21 12:50:18 (-0600)  pihole  Starting pihole-FTL (no-daemon) as root
20.02.21 12:50:18 (-0600)  pihole  [services.d] done.
20.02.21 12:49:59 (-0600)  unbound  [1613846999] unbound[1:0] notice: init module 0: validator
20.02.21 12:49:59 (-0600)  unbound  [1613846999] unbound[1:0] notice: init module 1: iterator
20.02.21 12:49:59 (-0600)  unbound  [1613846999] unbound[1:0] info: start of service (unbound 1.13.0).
20.02.21 12:50:21 (-0600) Starting service 'dbus sha256:ca6de21b82870c1b7278e017977d90be0fcf1c1bdb61ac7e059a6b194881cea5'
20.02.21 12:58:54 (-0600) Started service 'dbus sha256:ca6de21b82870c1b7278e017977d90be0fcf1c1bdb61ac7e059a6b194881cea5'
20.02.21 12:58:54 (-0600)  dbus  method return time=1613847534.437242 sender=:1.0 -> destination=:1.89 serial=585 reply_serial=2
20.02.21 12:58:54 (-0600)  dbus     object path "/org/freedesktop/systemd1/job/707"
20.02.21 12:58:54 (-0600)  dbus  method return time=1613847534.437242 sender=:1.0 -> destination=:1.89 serial=585 reply_serial=2
20.02.21 12:58:54 (-0600)  dbus     object path "/org/freedesktop/systemd1/job/707"
20.02.21 12:58:56 (-0600) Service exited 'dbus sha256:ca6de21b82870c1b7278e017977d90be0fcf1c1bdb61ac7e059a6b194881cea5'

DNS query:

bash-3.2$ nslookup
> server 10.10.10.10
Default server: 10.10.10.10
Address: 10.10.10.10#53
> google.com
;; connection timed out; no servers could be reached
>

Logs following pihole container restart:

20.02.21 13:02:07 (-0600) Killing service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:07 (-0600)  pihole  Stopping PADD
20.02.21 13:02:07 (-0600)  pihole  Stopping lighttpd
20.02.21 13:02:07 (-0600)  pihole  Stopping pihole-FTL
20.02.21 13:02:07 (-0600)  pihole  padd.sh: no process found
20.02.21 13:02:07 (-0600)  pihole  Stopping cron
20.02.21 13:02:07 (-0600)  pihole  [cont-finish.d] executing container finish scripts...
20.02.21 13:02:07 (-0600)  pihole  [cont-finish.d] done.
20.02.21 13:02:07 (-0600)  pihole  [s6-finish] waiting for services.
20.02.21 13:02:08 (-0600)  pihole  [s6-finish] sending all processes the TERM signal.
20.02.21 13:02:11 (-0600)  pihole  [s6-finish] sending all processes the KILL signal and exiting.
20.02.21 13:02:12 (-0600) Service exited 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:12 (-0600) Killed service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:13 (-0600) Installing service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:13 (-0600) Installed service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:13 (-0600) Starting service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:15 (-0600) Started service 'pihole sha256:4f00e2485051978fffdfb56d069f40da2b1e83ad1ef399926cc5ac157cc48264'
20.02.21 13:02:15 (-0600)  pihole  [s6-init] making user provided files available at /var/run/s6/etc...exited 0.
20.02.21 13:02:15 (-0600)  pihole  [s6-init] ensuring user provided files have correct perms...exited 0.
20.02.21 13:02:15 (-0600)  pihole  [s6-init] making user provided files available at /var/run/s6/etc...exited 0.
20.02.21 13:02:15 (-0600)  pihole  [s6-init] ensuring user provided files have correct perms...exited 0.
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] applying ownership & permissions fixes...
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: applying... 
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: exited 0.
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] done.
20.02.21 13:02:15 (-0600)  pihole  [cont-init.d] executing container initialization scripts...
20.02.21 13:02:15 (-0600)  pihole  [cont-init.d] 20-start.sh: executing... 
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] applying ownership & permissions fixes...
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: applying... 
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] 01-resolver-resolv: exited 0.
20.02.21 13:02:15 (-0600)  pihole  [fix-attrs.d] done.
20.02.21 13:02:15 (-0600)  pihole  [cont-init.d] executing container initialization scripts...
20.02.21 13:02:15 (-0600)  pihole  [cont-init.d] 20-start.sh: executing... 
20.02.21 13:02:15 (-0600)  pihole   ::: Starting docker specific checks & setup for docker pihole/pihole
20.02.21 13:02:16 (-0600)  pihole  
20.02.21 13:02:16 (-0600)  pihole    [i] Installing configs from /etc/.pihole...
20.02.21 13:02:16 (-0600)  pihole    [i] Existing dnsmasq.conf found... it is not a Pi-hole file, leaving alone!
20.02.21 13:02:16 (-0600)  pihole    [i] Copying 01-pihole.  [â] Copying 01-pihole.conf to /etc/dnsmasq.d/01-pihole.conf
20.02.21 13:02:17 (-0600)  pihole  Converting DNS1 to PIHOLE_DNS_
20.02.21 13:02:17 (-0600)  pihole  Converting DNS2 to PIHOLE_DNS_
20.02.21 13:02:17 (-0600)  pihole  Setting DNS servers based on PIHOLE_DNS_ variable
20.02.21 13:02:17 (-0600)  pihole  ::: Pre existing WEBPASSWORD found
20.02.21 13:02:17 (-0600)  pihole  DNSMasq binding to default interface: eth0
20.02.21 13:02:17 (-0600)  pihole  Added ENV to php:
20.02.21 13:02:17 (-0600)  pihole                       "PHP_ERROR_LOG" => "/var/log/lighttpd/error.log",
20.02.21 13:02:17 (-0600)  pihole                       "ServerIP" => "0.0.0.0",
20.02.21 13:02:17 (-0600)  pihole                       "VIRTUAL_HOST" => "0.0.0.0",
20.02.21 13:02:18 (-0600)  pihole  Using IPv4 and IPv6
20.02.21 13:02:18 (-0600)  pihole  ::: Preexisting ad list /etc/pihole/adlists.list detected ((exiting setup_blocklists early))
20.02.21 13:02:18 (-0600)  pihole  https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
20.02.21 13:02:18 (-0600)  pihole  ::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Name or service not known
20.02.21 13:02:19 (-0600)  pihole  FTL started!
20.02.21 13:02:19 (-0600)  pihole  ::: Testing lighttpd config: Syntax OK
20.02.21 13:02:19 (-0600)  pihole  ::: All config checks passed, cleared for startup ...
20.02.21 13:02:19 (-0600)  pihole  ::: Enabling Query Logging
20.02.21 13:02:19 (-0600)  pihole    [i] Enabling logging...
20.02.21 13:02:19 (-0600)  pihole    [i] Restarting DNS se  [â] Restarting DNS server
  [â] Logging has been enabled!
20.02.21 13:02:19 (-0600)  pihole   ::: Docker start setup complete
20.02.21 13:02:20 (-0600)  pihole    [i] Neutrino emissions detected...
  [â] Pulling blocklist source list into range
20.02.21 13:02:20 (-0600)  pihole  
20.02.21 13:02:20 (-0600)  pihole    [i] Preparing new gra  [â] Preparing new gravity database
20.02.21 13:02:20 (-0600)  pihole    [i] Using libz compression
20.02.21 13:02:20 (-0600)  pihole  
20.02.21 13:02:20 (-0600)  pihole    [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
20.02.21 13:02:20 (-0600)  pihole    [i] Status: Pending..  [â] Status: Retrieval successful
20.02.21 13:02:21 (-0600)  pihole    [i] Received 62904 domains
20.02.21 13:02:21 (-0600)  pihole  
20.02.21 13:02:21 (-0600)  pihole    [i] Target: https://mirror1.malwaredomains.com/files/justdomains
20.02.21 13:02:21 (-0600)  pihole    [i] Status: Pending..  [â] Status: Not found
20.02.21 13:02:21 (-0600)  pihole    [â] List download failed: no cached list available
20.02.21 13:02:21 (-0600)  pihole  
20.02.21 13:02:21 (-0600)  pihole    [i] Target: http://sysctl.org/cameleon/hosts
20.02.21 13:02:22 (-0600)  pihole    [i] Status: Pending..  [â] Status: No changes detected
20.02.21 13:02:22 (-0600)  pihole    [i] Received 20566 domains
20.02.21 13:02:22 (-0600)  pihole  
20.02.21 13:02:22 (-0600)  pihole    [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
20.02.21 13:02:23 (-0600)  pihole    [i] Status: Pending..  [â] Status: Retrieval successful
20.02.21 13:02:23 (-0600)  pihole    [i] Received 0 domains
20.02.21 13:02:23 (-0600)  pihole  
20.02.21 13:02:23 (-0600)  pihole    [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
20.02.21 13:02:23 (-0600)  pihole    [i] Status: Pending..  [â] Status: No changes detected
20.02.21 13:02:23 (-0600)  pihole    [i] Received 34 domains
20.02.21 13:02:23 (-0600)  pihole  
20.02.21 13:02:23 (-0600)  pihole    [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
20.02.21 13:02:24 (-0600)  pihole    [i] Status: Pending..  [â] Status: No changes detected
20.02.21 13:02:24 (-0600)  pihole    [i] Received 2701 domains
20.02.21 13:02:24 (-0600)  pihole  
20.02.21 13:02:24 (-0600)  pihole    [i] Target: https://hosts-file.net/ad_servers.txt
20.02.21 13:02:24 (-0600)  pihole    [i] Status: Pending..  [â] Status: Not found
20.02.21 13:02:24 (-0600)  pihole    [â] List download failed: no cached list available
20.02.21 13:02:24 (-0600)  pihole  
20.02.21 13:02:25 (-0600)  pihole    [i] Storing downloade  [â] Storing downloaded domains in new gravity database
  [â] Building tree
20.02.21 13:02:26 (-0600)  pihole    [i] Swapping database  [â] Swapping databases
20.02.21 13:02:27 (-0600)  pihole    [i] Number of gravity domains: 86205 (74489 unique domains)
20.02.21 13:02:27 (-0600)  pihole    [i] Number of exact blacklisted domains: 1
20.02.21 13:02:27 (-0600)  pihole    [i] Number of regex blacklist filters: 0
20.02.21 13:02:27 (-0600)  pihole    [i] Number of exact whitelisted domains: 108
20.02.21 13:02:27 (-0600)  pihole    [i] Number of regex whitelist filters: 0
20.02.21 13:02:27 (-0600)  pihole    [i] Flushing DNS cach  [â] Flushing DNS cache
20.02.21 13:02:27 (-0600)  pihole    [i] Cleaning up stray  [â] Cleaning up stray matter
20.02.21 13:02:27 (-0600)  pihole  
20.02.21 13:02:28 (-0600)  pihole    [â] DNS service is listening
20.02.21 13:02:28 (-0600)  pihole       [â] UDP (IPv4)
20.02.21 13:02:28 (-0600)  pihole       [â] TCP (IPv4)
20.02.21 13:02:28 (-0600)  pihole       [â] UDP (IPv6)
20.02.21 13:02:28 (-0600)  pihole       [â] TCP (IPv6)
20.02.21 13:02:28 (-0600)  pihole  
20.02.21 13:02:28 (-0600)  pihole    [â] Pi-hole blocking is enabled
20.02.21 13:02:28 (-0600)  pihole    Pi-hole version is v5.2.4 (Latest: v5.2.4)
20.02.21 13:02:28 (-0600)  pihole    AdminLTE version is v5.4 (Latest: v5.4)
20.02.21 13:02:29 (-0600)  pihole    FTL version is v5.7 (Latest: v5.7)
20.02.21 13:02:29 (-0600)  pihole  [cont-init.d] 20-start.sh: exited 0.
20.02.21 13:02:29 (-0600)  pihole  [cont-init.d] done.
20.02.21 13:02:29 (-0600)  pihole  [services.d] starting services
20.02.21 13:02:29 (-0600)  pihole  Starting lighttpd
20.02.21 13:02:29 (-0600)  pihole  Starting pihole-FTL (no-daemon) as root
20.02.21 13:02:29 (-0600)  pihole  Starting PADD
20.02.21 13:02:29 (-0600)  pihole  Starting crond
20.02.21 13:02:29 (-0600)  pihole  [services.d] done.

DNS query following pihole container restart:

bash-3.2$ nslookup
> server 10.10.10.10
Default server: 10.10.10.10
Address: 10.10.10.10#53
> google.com
Server:     10.10.10.10
Address:    10.10.10.10#53

Non-authoritative answer:
Name:   google.com
Address: 172.217.9.142
>

klutchell commented 3 years ago

Thanks for the logs, that's super helpful! I see that on first boot it is unable to update any of the blocklists (Status: Connection Refused) but on restart the updates work (Status: No changes detected).

On it's own this wouldn't be a problem but it indicates that on first start the pihole container is having connection issues. Have you modified the docker-compose file at all? Can you share some of your env vars?

DNSMASQ_LISTENING
INTERFACE
DNS1
DNS2
others? (not webpassword)

Also could you try setting the ServerIP variable to the LAN IP of your device and see if the behavior changes?

eiddor commented 3 years ago

Wow, good catch - I'm not sure how I missed that difference. I saw some "Connection Refused" messages but didn't notice that it was for every blocklist and kind of got used to seeing them.

No modification to docker-compose, especially while testing these branches for you. Once I get a stable version, I do change some stuff in pihole/Dockerfile to pull in a hosts file and change the font size, but again I've been careful not to do that while testing.

Environment variables:

DNS1 208.67.222.222 DNS2 208.67.220.220 DNSMASQ_LISTENING eth0 INTERFACE eth0 TZ America/Chicago

Just added ServerIP and I'm testing now. Is that a recent addition?

eiddor commented 3 years ago

No luck after adding ServerIP - I was really hoping for it to be that simple :-)

eiddor commented 3 years ago

Just to add - there's nothing special about my network:

The RPi3b is wired directly into a Cisco 9200 switch with no auth configured, so the port comes up instantly.

klutchell commented 3 years ago

ServerIP has been required since v4.2.2. I'll remove the optional label from the README. I also see a bunch of other env vars that are being deprecated so I'll have to clean that up soon.

https://github.com/pi-hole/docker-pi-hole#deprecated-environment-variables

Starting with the v4.1.1 release your Pi-hole container may encounter issues starting the DNS service unless ran with the following setting: --dns=127.0.0.1 --dns=1.1.1.1 The second server can be any DNS IP of your choosing, but the first dns must be 127.0.0.1

https://github.com/pi-hole/docker-pi-hole#docker-pi-hole-v411

You're not blocking 1.1.1.1 on your network in any way?

eiddor commented 3 years ago

Oh, man - Their Docker hub docs are a bit out of sync with the GutHub docs - https://hub.docker.com/r/pihole/pihole. I need to revisit my "new" pihole device now (even though it's working fine).

Also, I'm reading this as ServerIP not being required unless we're running host mode. Does Balena run host mode by default? (Should it?)

Docker Pi-Hole v4.2.2 ServerIP no longer a required environment variable unless you run network 'host' mode! Feel free to remove it unless you need it to customize lighttpd

As for the DNS requirement - This is where my knowledge (or lack of) of Balena breaks down. Looking at docker-compose.yml it seems that we're covered:

dns:
      - "127.0.0.1"
      - "1.1.1.1"
    environment:
      DNSMASQ_LISTENING: eth0
      INTERFACE: eth0
      DNS1: 127.0.0.1#5053
      DNS2: 127.0.0.1#5053

Now, I'm guessing that the environment variables in the GUI override these settings (where does that magic happen?).

The thing is, I want to run OpenDNS as my upstream DNS (I work for Cisco). How do we reconcile that with the 127.0.0.1 requirement?

With the new method, I guess it would be like this:

PIHOLE_DNS_: 127.0.0.1#5053;208.67.222.222;208.67.220.220

Now, do I have to wait until Balena Pihole gets an update before I can use that?

(1.1.1.1 is fully accessible in my network, I use it all the time to test web redirect portals)

eiddor commented 3 years ago

I decided to try to set DNS1 to 127.0.0.1 and am still having the problem, so it wasn't that.

FYI - on my dedicated RPi running pihole, this syntax worked for me and pihole was even "smart" enough to recognize OpenDNS as the upstream

PIHOLE_DNS_ = 127.0.0.1;208.67.222.222;208.67.220.220

klutchell commented 3 years ago

Also, I'm reading this as ServerIP not being required unless we're running host mode. Does Balena run host mode by default? (Should it?)

Balena does not run in host mode by default, but this Pi-Hole stack does in order to avoid port conflicts.

https://github.com/klutchell/balena-pihole/blob/2c028ecc409e42431d4f57c75bf5066bbed7fac7/docker-compose.yml#L23

^^ So as you can see from the line above, ServerIP should be provided for this stack.

https://github.com/klutchell/balena-pihole/blob/2c028ecc409e42431d4f57c75bf5066bbed7fac7/docker-compose.yml#L15-L17

^^ These lines specify the container DNS, and aren't directly related to Pi-Hole, but according to their docs we want values similar to this to make sure Pi-Hole can start correctly (needs to resolve before it can resolve!).

In fact, since these are only used at container startup and not to perform actual runtime queries, you could replace the 1.1.1.1 with 208.67.222.222 to use OpenDNS there as well. As long as 127.0.0.1 is first.

https://github.com/klutchell/balena-pihole/blob/2c028ecc409e42431d4f57c75bf5066bbed7fac7/docker-compose.yml#L21-L22

^^ These DNS lines are overwritten if you provide alternate values in the dashboard. So by providing 208.67.222.222 and 1208.67.220.220 you are effectively using OpenDNS for your upstream (not related to dns: above).

Now, I'm guessing that the environment variables in the GUI override these settings (where does that magic happen?).

The balena supervisor will apply dashboard environment variables to the services and override existing ones.

You can go ahead and start using PIHOLE_DNS_=208.67.222.222;208.67.220.220 and remove the two DNS lines from your compose file and dashboard now if you want. I already have for my setup.

eiddor commented 3 years ago

Thanks for the thorough explanation, @klutchell !

You can go ahead and start using PIHOLEDNS=208.67.222.222;208.67.220.220 and remove the two DNS lines from your compose file and dashboard now if you want. I already have for my setup.

Done.

So it looks like after all of this we've:

a) taught Roddie a lot about how Balena works

b) figured out that we need to update some environment variables and docs in the project

Not a waste for me at all, but it seems like I'm still at square one and you ended up working this weekend anyway.

klutchell commented 3 years ago

Nah dude, I still got lots of gaming and dog walks in this weekend. We're all good.

https://github.com/klutchell/balena-pihole/blob/2c028ecc409e42431d4f57c75bf5066bbed7fac7/docker-compose.yml#L15-L17

What if you changed the second entry here to be something else, like google or opendns? Any difference?

Once I get a stable version, I do change some stuff in pihole/Dockerfile to pull in a hosts file and change the font size

Can you clarify these steps a bit? Anything that would impact the host OS that you would know of? I wonder if this would happen with a fresh image, or another device, to isolate the issue to the environment/network or the device.

Pro Tip: if you want to learn a ton more about how balena works, these masterclasses are killer and actually pretty fun. https://www.balena.io/docs/learn/more/masterclasses/overview/

eiddor commented 3 years ago

Nah dude, I still got lots of gaming and dog walks in this weekend. We're all good.

Ha! As long as the wife isn't blaming me for gaming, we're ok.

What if you changed the second entry here to be something else, like google or opendns? Any difference?

No difference.

Can you clarify these steps a bit? Anything that would impact the host OS that you would know of? I wonder if this would happen with a fresh image, or another device, to isolate the issue to the environment/network or the device.

Harmless, I believe:

RUN echo "addn-hosts=/etc/pihole/lan.list" | sudo tee /etc/dnsmasq.d/02-lan.conf
RUN curl -sSL https://x.x.x.x/roddie/hosts -o /etc/pihole/lan.list

ENV FONTFACE Terminus
ENV FONTSIZE 10x20

Before I even opened the initial display issue I started a new application with a new device to make sure it wasn't something left-over from the last year or three. Happy to try again and only start with the dtoverlay branch. It'll take me a half hour or so.

Pro Tip: if you want to learn a ton more about how balena works, these masterclasses are killer and actually pretty fun. https://www.balena.io/docs/learn/more/masterclasses/overview/

Excellent! Can't wait to go through them!

eiddor commented 3 years ago

@klutchell Fresh application and device, using the dtoverlay branch with no modifications. Same problem.

klutchell commented 3 years ago

Okay, thanks for testing. You're correct that your production modifications are harmless as well. Though if you wanted you could also set FONTFACE=Terminus and FONTSIZE=10x20 in the balena dashboard and skip changing the Dockerfile for those. The supervisor will use the dashboard values above all else.

For your lan hosts those can be added manually via the Pi-Hole dashboard under Local DNS -> DNS Records but your way is faster since you already have the list hosted somewhere.

Let me think about your container startup issue some more, nothing obvious is coming to me right now.

But remember that we aren't at square one cause we fixed your PADD display (and mine) and we noticed that some Pi-Hole env vars are being deprecated so we are tracking that. Also a learning experience isn't a waste in this line of work.

eiddor commented 3 years ago

Though if you wanted you could also set FONTFACE=Terminus and FONTSIZE=10x20 in the balena dashboard and skip changing the Dockerfile for those. The supervisor will use the dashboard values above all else.

Good to know! Is that documented somewhere? I only noticed those settings in pihole/Dockerfile a couple of months ago. It's nice to finally use all of my screen real estate :-)

For the LAN hosts there was a whole discussion about it in the comments section of the original balena-pihole blogpost from a few years ago and a few different options were thrown around. It's my entire network, which is about 90 hosts and this is just a hosts file that I have always maintained on my "main" server so that I can quickly find something if needed. It keeps things semi-automated, at least.

Related: I learned what Docker was due to that blog post and the related comment thread, and ended up writing a whole blog series about it.

Let me think about your container startup issue some more, nothing obvious is coming to me right now.

I appreciate it - Let me know if you want me to test anything. Two things are nagging at me:

1) I don't like host mode - I'm sure it's not causing this problem, but I really don't like host mode.

2) Something in the back of my mind keeps telling me that I see this connectivity issue whenever we have display-related changes. I'll play with some of them to see if I can get it to "unbreak."

But remember that we aren't at square one cause we fixed your PADD display (and mine) and we noticed that some Pi-Hole env vars are being deprecated so we are tracking that. Also a learning experience isn't a waste in this line of work.

True true - And I got my 3rd or 4th balena project PR submitted.

klutchell commented 3 years ago

If you'd like to try it without host mode you can replace the following line...

https://github.com/klutchell/balena-pihole/blob/c2dd4c64fc0f48727ab5fbf96808e4896e06a300/docker-compose.yml#L22

... with the below lines using your device's LAN IP.

    ports:
      - "LAN_IP:53:53/tcp"
      - "LAN_IP:53:53/upd"

This is to prevent the balena engine from binding to 53 on ALL interfaces, because 53 is already in use on the balena VPN interface.

Then you should also remove the following lines from the pihole Dockerfile:

https://github.com/klutchell/balena-pihole/blob/c2dd4c64fc0f48727ab5fbf96808e4896e06a300/pihole/Dockerfile#L17-L20

After that you'll no longer be running in host mode and we can see if the behaviour changes.

You can also try resolving this error ...

::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Temporary failure in name resolution

... by adding 127.0.0.1 7ed1bcc to your hosts file or in the Pi-hole dashboard. Normally that error doesn't seem to have any impact but that's how I cleared it on my devices.

eiddor commented 3 years ago

After that you'll no longer be running in host mode and we can see if the behaviour changes.

So, I actually ended up trying this last night (I added 80/443 as well for the GUI), and I couldn't get it past this state:

Just redid it using only the 53s and got the same results. I actually had to do a "Purge data" to get it to take another push and reload again. I don't want to spend too much time on the host mode thing. If anything it should make the network connectivity from the container better, not worse. Just a sticking-point with me since networking is actually my happy place.

::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Temporary failure in name resolution

Got rid of that last night, too - No change. That one has always irritated me.

I just noticed that I didn't have dtoverlay defined (it's usually set to rpi-display), but the display has been working the whole time. Not really meaningful, just interesting.

klutchell commented 3 years ago

When you tried it did you specify the LAN IP in your ports? This part is very important to not conflict with balena services running on port 53 and may be why you ended up in a bad state?

    ports:
      - "80:80/tcp"
      - "LAN_IP:53:53/tcp"
      - "LAN_IP:53:53/upd"

eiddor commented 3 years ago

I tried both ways with every combination - Maybe there's another port that balena is expecting? I may come back to the host mode thing later once we figure out the problem, but in my mind host mode should make things a bit more open here and I don't want to get too distracted with it.

I'm currently in a non-working state and here's what I'm seeing:

the GUI works from the outside (?!)
local DNS queries from the pihole container work
DNS queries from outside the pihole container do not work (as in the OP)
from the host I can do a DNS query to localhost but not to the ServerIP

root@57e7e72:~# nslookup google.com
Server:    127.0.0.2
Address 1: 127.0.0.2 57e7e72

Name:      google.com
Address 1: 172.217.1.238 lax17s02-in-f14.1e100.net
Address 2: 2607:f8b0:4000:814::200e dfw25s27-in-x0e.1e100.net

root@57e7e72:~# nslookup google.com 10.10.10.10
Server:    10.10.10.10
Address 1: 10.10.10.10 57e7e72

nslookup: can't resolve 'google.com'

root@57e7e72:~# nslookup google.com 127.0.0.1 
Server:    127.0.0.1
Address 1: 127.0.0.1 57e7e72

Name:      google.com
Address 1: 172.217.9.174 dfw25s27-in-f14.1e100.net
Address 2: 2607:f8b0:4000:814::200e dfw25s27-in-x0e.1e100.net

Docker doesn't seem to think 53 is bound anywhere else:

root@57e7e72:~# balena ps -a --format="table {{.ID}}\t{{.Names}}\t{{.Ports}}\t{{.Status}}"
CONTAINER ID        NAMES                     PORTS                                            STATUS
71fa5916cc18        pihole_3310990_1706164                                                     Up 35 minutes (healthy)
6221682a7f82        dbus_3310993_1706164                                                       Exited (0) 34 minutes ago
da3f01c7f402        fbcp_3310992_1706164                                                       Up 35 minutes
d6762440786d        unbound_3310991_1706164   0.0.0.0:5053->5053/tcp, 0.0.0.0:5053->5053/udp   Up 35 minutes
f108ef13311c        resin_supervisor                                                           Up 34 minutes (healthy)

Simply restarting the pihole container makes everything work.

klutchell commented 3 years ago

Very odd, sorry if it feels like we are spinning our wheels here but I really want to find out what changes in the container between restarts. Can you check the content of /etc/dnsmasq.conf before and after the restart?

Also, I'm not sure what state you are in but if we are still using host networking what happens if you just remove the following line from your Dockerfile which will allow binding to all interfaces?

https://github.com/klutchell/balena-pihole/blob/c2dd4c64fc0f48727ab5fbf96808e4896e06a300/pihole/Dockerfile#L17-L20

eiddor commented 3 years ago

Oh no, I absolutely appreciate this and since it does seem to be network-adjacent, I am enjoying it. Plus I only use two balena apps, so the more I get to tinker with them the better. I think we're getting closer.

Check my logic here:

ServerIP:53 isn't open to the container. Docker doesn't seem to see a problem, but because we're using host networking, we don't see the explicit port as open.

ServerIP:80 works fine, so it's not an IP-binding issue. 127.0.0.1:53 works fine from the host, so it's not a DNS process issue.

Reproducing this is just a matter of doing a shutdown and power-on, so it's easy for me to test whatever can think of.

/etc/dnsmasq.conf content is the same regardless of state:

conf-dir=/etc/dnsmasq.d
bind-interfaces

user=root

Commenting out that line puts the pihole container into a very unhappy loop:

22.02.21 13:25:10 (-0600)  pihole  dnsmasq: failed to create listening socket for port 53: Address already in use
22.02.21 13:25:10 (-0600)  pihole  [cont-init.d] 20-start.sh: exited 1.
22.02.21 13:25:10 (-0600)  pihole  [cont-finish.d] executing container finish scripts...
22.02.21 13:25:10 (-0600)  pihole  [cont-finish.d] done.
22.02.21 13:25:10 (-0600)  pihole  [s6-finish] waiting for services.
22.02.21 13:25:10 (-0600)  pihole  [s6-finish] sending all processes the TERM signal.
22.02.21 13:25:13 (-0600)  pihole  [s6-finish] sending all processes the KILL signal and exiting.
22.02.21 13:25:15 (-0600) Service exited 'pihole sha256:761dcf61b473bc430c0f2c473ea4ed0303a8ee8da9f56b33a65c3b0b527f5b4d'

klutchell commented 3 years ago

Okay, that's the behaviour I was expecting and why that line exists. Can you confirm that the interface provided for INTERFACE and DNSMASQ_LISTENING matches the interface name on the host for your wired connection?

It seems almost like dnsmasq is binding to the wrong interface on first startup (like localhost only), but finding the correct one on a container restart. That would be very weird though, and somehow related to a race condition on startup or a delayed network event? I don't suppose you have a secondary network where you could connect one of these devices? Or even try wifi to see if the behaviour changes? I know I'm grasping at straws now.

eiddor commented 3 years ago

Confirmed they're both eth0

You read my mind on dnsmasq (I think we got there at the same time) - I'm collecting some lsof output at the moment. Give me a few minutes.

eiddor commented 3 years ago

lsof output in a "broken" state:

root@57e7e72:~# lsof -iTCP -sTCP:LISTEN
COMMAND    PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
systemd      1     root   53u  IPv6  16019      0t0  TCP *:22222 (LISTEN)
dnsmasq   1531   nobody    5u  IPv4  20594      0t0  TCP 57e7e72:domain (LISTEN)
dnsmasq   1531   nobody    7u  IPv4  20596      0t0  TCP 57e7e72:domain (LISTEN)
balena-en 1849     root    6u  IPv6  23559      0t0  TCP *:5053 (LISTEN)
node      2798     root   24u  IPv6  27443      0t0  TCP *:48484 (LISTEN)
lighttpd  2938 www-data    4u  IPv4  27285      0t0  TCP *:http (LISTEN)
lighttpd  2938 www-data    5u  IPv6  27286      0t0  TCP *:http (LISTEN)
pihole-FT 2948     root    5u  IPv4  25471      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 2948     root    7u  IPv6  25473      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 2948     root    9u  IPv6  25475      0t0  TCP localhost:domain (LISTEN)
pihole-FT 2948     root   12u  IPv6  26154      0t0  TCP localhost:4711 (LISTEN)
pihole-FT 2948     root   17u  IPv4  25481      0t0  TCP 57e7e72:4711 (LISTEN)
root@57e7e72:~# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
chronyd   1424   root    5u  IPv4  19470      0t0  UDP *:1234 
chronyd   1424   root    6u  IPv6  19471      0t0  UDP *:1234 
avahi-dae 1513  avahi   11u  IPv4  19104      0t0  UDP *:5353 
avahi-dae 1513  avahi   12u  IPv6  19105      0t0  UDP *:5353 
avahi-dae 1513  avahi   13u  IPv4  19106      0t0  UDP *:62747 
avahi-dae 1513  avahi   14u  IPv6  19107      0t0  UDP *:49811 
NetworkMa 1523   root   18u  IPv4  28875      0t0  UDP 10.10.10.10:68 
dnsmasq   1531 nobody    4u  IPv4  20593      0t0  UDP 10.114.102.1:53 
balena-en 1864   root    6u  IPv6  21263      0t0  UDP *:5053 
pihole-FT 2948   root    6u  IPv6  25472      0t0  UDP [fe80::84ae:1c5f:e361:59d2]:53

lsof output in a "working" state:

root@57e7e72:~# lsof -iTCP -sTCP:LISTEN
COMMAND    PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
systemd      1     root   53u  IPv6  16019      0t0  TCP *:22222 (LISTEN)
dnsmasq   1531   nobody    5u  IPv4  20594      0t0  TCP 57e7e72:domain (LISTEN)
dnsmasq   1531   nobody    7u  IPv4  20596      0t0  TCP 57e7e72:domain (LISTEN)
balena-en 1849     root    6u  IPv6  23559      0t0  TCP *:5053 (LISTEN)
node      2798     root   24u  IPv6  27443      0t0  TCP *:48484 (LISTEN)
lighttpd  7246 www-data    4u  IPv4  41012      0t0  TCP *:http (LISTEN)
lighttpd  7246 www-data    5u  IPv6  41013      0t0  TCP *:http (LISTEN)
pihole-FT 7254     root    5u  IPv4  40101      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 7254     root    7u  IPv4  40103      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 7254     root    9u  IPv6  40105      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 7254     root   11u  IPv6  40107      0t0  TCP localhost:domain (LISTEN)
pihole-FT 7254     root   14u  IPv4  39862      0t0  TCP 57e7e72:4711 (LISTEN)
pihole-FT 7254     root   17u  IPv6  39865      0t0  TCP localhost:4711 (LISTEN)
root@57e7e72:~# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
chronyd   1424   root    5u  IPv4  19470      0t0  UDP *:1234 
chronyd   1424   root    6u  IPv6  19471      0t0  UDP *:1234 
avahi-dae 1513  avahi   11u  IPv4  19104      0t0  UDP *:5353 
avahi-dae 1513  avahi   12u  IPv6  19105      0t0  UDP *:5353 
avahi-dae 1513  avahi   13u  IPv4  19106      0t0  UDP *:62747 
avahi-dae 1513  avahi   14u  IPv6  19107      0t0  UDP *:49811 
NetworkMa 1523   root   18u  IPv4  28875      0t0  UDP 10.10.10.10:68 
dnsmasq   1531 nobody    4u  IPv4  20593      0t0  UDP 10.114.102.1:53 
balena-en 1864   root    6u  IPv6  21263      0t0  UDP *:5053 
pihole-FT 7254   root    4u  IPv4  40100      0t0  UDP 10.10.10.10:53 
pihole-FT 7254   root    8u  IPv6  40104      0t0  UDP [fe80::84ae:1c5f:e361:59d2]:53

To save you some squinting, there are two differences here that I'm seeing (you may already be ahead of me with your thinking).

One of these IPv4 TCP:53 listeners is missing when things are broken (I don't know why there are two):

pihole-FT 7254     root    5u  IPv4  40101      0t0  TCP 57e7e72:domain (LISTEN)
pihole-FT 7254     root    7u  IPv4  40103      0t0  TCP 57e7e72:domain (LISTEN)

More importantly, this UDP:53 listener is missing when things are broken:

pihole-FT 7254   root    4u  IPv4  40100      0t0  UDP 10.10.10.10:53

In both cases, dnsmasq is listening on the containerip:53but that doesn't change and shouldn't matter.

If dnsmasq is getting in the way at some point, it's gone by the time I have access to the console.

I also don't like that whatever is failing is doing so silently. Are there any other logs that I can collect that might help?

klutchell commented 3 years ago

Good findings, I think we are narrowing it down.

I've never tried this before now, but there seems to be a lot of info in the debug logs tool in the Pi-hole dashboard.

https://discourse.pi-hole.net/t/how-do-i-debug-my-pi-hole-installation/3104

Might be worth grabbing these before and after a restart and comparing them. The farther we get into FTL and dnsmasq territory the less help I'll be, but I'm happy to learn!

eiddor commented 3 years ago

Yeah, I think we're close. I've been trying various things while on calls, but I don't think I've learned anything.

That page looks to be more about secure transport of the logs /var/log than anything else - Unless I'm missing something. I'll have to dig some more and see exactly where I need to submit a debug token.

FWIW I have looked at /var/log/pihole.log and /var/log/pihole-FTL.log - nothing interesting.

Also nothing in dmesg of the pihole container - nothing.

Unsurprisingly, host networking is still nagging at me - I wonder if we can eliminate that, if we'd be able to see the binding in Docker (or an error)?

The fact that we can see a difference between lsof outputs is encouraging, I just can't help but think that somewhere somehow there's a Docker log that is waiting for us to look at it.

klutchell commented 3 years ago

No need to submit with a token, just click the button to generate them and a surprising amount of information is collected. Not just logs but simple diagnostic tests. Then you can copy them somewhere and compare.

klutchell commented 3 years ago

The balena engine logs can be viewed with journalctl -u balena -a.

eiddor commented 3 years ago

The balena engine logs can be viewed with journalctl -u balena -a.

I bet that's covered in the Masterclass which is tab number 28 in my browser :-)

Nothing interesting there - Everyone thinks everyone else is happy.

No need to submit with a token, just click the button to generate them and a surprising amount of information is collected. Not just logs but simple diagnostic tests. Then you can copy them somewhere and compare.

Duh - That makes a whole lot more sense than what I was thinking. Ok, I'll grab them and go through them tonight/tomorrow and let you know if anything jumps out.

Thanks again for all of your help here!

BTW - This is from within the container:

Not working:

root@57e7e72:/usr/src/app#  lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
pihole-FT 526 root    6u  IPv6  26222      0t0  UDP [fe80::84ae:1c5f:e361:59d2]:53

Working:

root@57e7e72:/usr/src/app#  lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
pihole-FT 544 root    4u  IPv4  89624      0t0  UDP 10.10.10.10:53 
pihole-FT 544 root    8u  IPv6  89628      0t0  UDP [fe80::84ae:1c5f:e361:59d2]:53

eiddor commented 3 years ago

So, a lot of good info in these logs - Nothing too in the weeds and everything is pretty self-explanatory.

The only difference is kind of disappointing, though. An extra port "in use" which we already know about.

Not working:

*** [ DIAGNOSING ]: Ports in use
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL

Working:

*** [ DIAGNOSING ]: Ports in use
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL

The rest of the debug output is virtually identical. I'll try again tomorrow to make sure I'm not just really tired and missing something, but it seems like whatever is happening is being ignored by everything.

A few ideas --

1) Is there a way to delay the pihole container's start and manually start it to see if it gives an error? If it works, we'll know that it's more of a timing/sequence issue.

2) I figure out how to disable host mode and troubleshoot directly with the ports via docker.

3) I strip all of the display-related stuff from my config and eliminate that as a variable.

klutchell commented 3 years ago

Is there a way to delay the pihole container's start and manually start it to see if it gives an error? If it works, we'll know that it's more of a timing/sequence issue.

Adding the following line to the bottom of your pihole service definition in the docker-compose file will not delay the container start, but it will delay any services running in the container.

    entrypoint: /bin/bash
    command:
      - -c
      - "sleep 60 && /s6-init"

Delaying the container start is a bit more tricky but we will see if that is even required.

I figure out how to disable host mode and troubleshoot directly with the ports via docker.

The following changes worked for my device:

Set the ports to only bind to the lan IP (160 in my case)

# network_mode: host
ports:
  - 192.168.1.160:53:53/tcp
  - 192.168.1.160:53:53/udp
  - 192.168.1.160:80:80/tcp

Set DNSMASQ_LISTENING and INTERFACE to all in the dashboard

I strip all of the display-related stuff from my config and eliminate that as a variable.

Just remove the fbcp and dbus services from docker-compose and delete pihole/services/padd/run. You can also remove the unbound service definition since you aren't using it.

eiddor commented 3 years ago

Oh, good - A few things to try today. Thank you! I'll re-run the pihole debugs first to make sure I didn't miss anything, and then I'll decide which order to try the rest in.

~~Did I read the last main commit correctly? Is it all caught up and safe for me to use now?~~

Nope, hasn't been merged yet - I'll keep using dtoverlay.

klutchell commented 3 years ago

Yeah, I'm waiting on the balenablocks/fbcp PR to be merged but I'd like someone else to take a look at it first.

eiddor commented 3 years ago

Here are the debugs in case you want to check my work - I'm actually impressed with how these are laid-out and annotated. There's some really cool logic here that even I understand.

pihole_debug_not_working.txt pihole_debug_working.txt

I'm going to tackle this in this order:

1) Pare back services (PADD, unbound, fbcp, dbus) one at a time

2) Delay the services in the pihole container to see if it works. If it doesn't, I'll attempt to run them manually and see if anything complains.

3) Disable host mode and try to collect some Docker output.

eiddor commented 3 years ago

So - Good news, I suppose. It's not related to any of the other services. I'm only running pihole right now and it's still breaking.

I really hope it's nothing in my environment, but I can't see how it would be. Connectivity is there, the GUI works even when DNS doesn't.

Next step is to delay the services in the container itself.

eiddor commented 3 years ago

More "good" news, only because it gets us closer. The sleep works.

So now we have a timing issue somewhere... All of the other containers are gone, so what's left? The Supervisor and the pihole container?

The problem I have right now is that I can't get into the container to manually run s6-init before whatever condition goes away.

I'm currently messing with the sleep timer to see if I can find the "breaking" point, but that's probably not going to help much.

On the bright side - host mode is in the clear (for now).

klutchell commented 3 years ago

I would still try without host mode if possible. If something in the network stack or environment is changing in the first few seconds of container start, it will be nice to isolate that a little more.

Knowing how long of a delay will workaround the issue is interesting but not super helpful as you said.

eiddor commented 3 years ago

Knowing how long of a delay will workaround the issue is interesting but not super helpful as you said.

~11 seconds heh. That took me way too long to narrow down, but that's about where it is.

Going for non-host mode now.

eiddor commented 3 years ago

So, remember that nagging thought that I've mentioned once or twice?

Disabling host mode gets rid of the issue. I'm on my tenth test just to make sure, but yeah... everything is working fine following a shutdown/restart. I started with a fresh archive and only disabled host mode, so everything else is in-tact.

Now the question is: what can/should we do about it? Whatever is happening should show up in the Docker/balena logs, right?

eiddor commented 3 years ago

Also, just to eliminate my environment a bit more - I took my other RPi3B running Pi-hole in Docker (non-Balena), and switched it to host networking. No problems.

klutchell commented 3 years ago

Yup, that tells me that some other interface is being created or binding at the same time the Pi-hole container is starting on boot. Maybe something related to the supervisor and/or balena vpn but I'll have to ask internally to get a hint of which direction to look.

I wish we could push the non-host network option to everyone but manually entering the ServerIP in the docker-compose.yml is a lot of friction for a new user. We also don't have any way to substitute environment variables in the compose file at runtime either.

Perhaps we should try some tests with non-host mode as we have now, but don't specify the ServerIP in the port binding?

    # network_mode: host
    ports:
      - 53:53/tcp
      - 53:53/udp
      - 80:80/tcp

In my experience that usually results in one of two things:

balena engine complaining that port 53 is in use on one of the interfaces and refusing to start the pihole service
binding succeeds but the device stops communicating with the balena API

So if this is indeed a race condition, and we can trigger the 2nd outcome, I would expect more useful logs in the balena supervisor.

journalctl -u resin-supervisor

I can try this test as well to try to reproduce the 2nd case. We should also remove the sleep for this test.

eiddor commented 3 years ago

Maybe something related to the supervisor and/or balena vpn but I'll have to ask internally to get a hint of which direction to look.

Right - That was my next question: does belena do anything special with UDP 53 at startup?

I also don't like the fact that I seem to be the only one having a problem.

I wish we could push the non-host network option to everyone but manually entering the ServerIP in the docker-compose.yml is a lot of friction for a new user. We also don't have any way to substitute environment variables in the compose file at runtime either.

I've been thinking about this one. Host networking effectively binds to , right? Earlier you had me specify LAN_IP so we wouldn't step on the 53 binding that Balena VPN needed. Wouldn't they be fighting over it with host networking enable? (maybe leading to my problem?)

I agree that we can't push it to the user, but there's gotta be an answer here somewhere.

Perhaps we should try some tests with non-host mode as we have now, but don't specify the ServerIP in the port binding?

Yep, that makes sense - I'll work on this tomorrow.

Sleep is long-gone - I'm currently sitting at dtoverlay with host networking disabled.

Thanks again for your help! I think we're there and just need a good answer :-)

klutchell commented 3 years ago

I've been thinking about this one. Host networking effectively binds to , right? Earlier you had me specify LAN_IP so we wouldn't step on the 53 binding that Balena VPN needed. Wouldn't they be fighting over it with host networking enable? (maybe leading to my problem?)

I think the issue is in this area, yes. The dnsmasq conf entry we are adding is supposed to ensure dnsmasq only binds to the interfaces we specify (eth0 in this case) and up until now no one has reported any issue with that not working as intended.

I pinged the whole balena team internally so I may have some suggestions by tomorrow. I'm also hoping that if we can get the container to bind to 53 on all interfaces the device logs will give us a clue as to where the conflict is.

klutchell commented 3 years ago

One idea that was floated was to run lsof -nP -iTCP:53 -sTCP:LISTEN in a tight loop on startup. Preferably on the host but maybe easier in the pihole container (or a new service) initially.

I'll leave it to you to replace the pihole init command with a little bash script, or you can wait for me to give it a try tomorrow and I can provide my steps.

eiddor commented 3 years ago

One idea that was floated was to run lsof -nP -iTCP:53 -sTCP:LISTEN in a tight loop on startup. Preferably on the host but maybe easier in the pihole container (or a new service) initially.

I thought of that a couple of days ago, but I was hoping for a way to get into the device to run it manually before things settle. Your idea is much better, but I don't believe lsof inside of a container will see ports on the host.

Let me flip to host networking and test that first.

eiddor commented 3 years ago

Yeah, no good inside of a container -

Host:

root@57e7e72:~# lsof -nP -iTCP:5053 -sTCP:LISTEN
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
balena-en 1840 root    6u  IPv6  22086      0t0  TCP *:5053 (LISTEN)

Container:

root@57e7e72:/usr/src/app# lsof -nP -iTCP:5053 -sTCP:LISTEN
root@57e7e72:/usr/src/app#

So now we're getting a bit above my pay grade, but I'll try to read through this thread and this thread to see what I can figure out/learn.

Running the script is one challenge, but also... where will the output go? I guess I could redirect it to /tmp/something.log

I also need to figure out how to fork said script.

I have a headache.

eiddor commented 3 years ago

I got as far as "build your own balena image" and stopped. We might need a better mousetrap.

klutchell commented 3 years ago

I think I have a better idea of what's happening now. The dnsmasq on the host is only supposed to bind to the VPN interface and a special loopback, so there shouldn't be an issue there. I did consider that the host is not using the bind-interfaces option and we could try adding that option your device only as a test if we want. I can provide quick instructions.

root@2cac2ee:~# lsof -nP -iTCP:53 -sTCP:LISTEN
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq 1366 nobody    5u  IPv4  15269      0t0  TCP 10.114.102.1:53 (LISTEN)
dnsmasq 1366 nobody    7u  IPv4  15271      0t0  TCP 127.0.0.2:53 (LISTEN)

On my devices there is a delay in the logs between using the default balena upstream DNS (8.8.8.8) and my DHCP assigned DNS (192.168.1.55). This delay probably explains the race condition, since your container must be starting before an updated DHCP ack is received.

root@2cac2ee:~# journalctl -u dnsmasq
-- Logs begin at Tue 2020-08-04 14:03:36 UTC, end at Wed 2021-02-24 14:00:32 UTC. --
Feb 24 10:14:05 2cac2ee dnsmasq[1361]: dnsmasq: syntax check OK.
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: started, version 2.78 cachesize 150
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: DBus support enabled: connected to system bus
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: reading /etc/resolv.dnsmasq
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: read /etc/hosts - 6 addresses
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: reading /etc/resolv.dnsmasq
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 192.168.1.55#53

I'd like to see your dnsmasq logs before anything else, since this may indicate why your devices behave differently. It crossed my mind that you are using a 10.x.x.x subnet but I sure hope that's not related or we have much bigger problems.

klutchell commented 3 years ago

If you want to try adding the bind-interfaces option to the host you can do the following. Note that this is just a test and it's not recommended to modify the host OS as all changes are lost on an OS upgrade.

mount -o remount,rw /
cp /etc/dnsmasq.conf /etc/dnsmasq.conf.orig
echo bind-interfaces >> /etc/dnsmasq.conf
sync
reboot

eiddor commented 3 years ago

My dnsmasq logs:

-- Logs begin at Wed 2021-02-24 15:51:59 UTC, end at Wed 2021-02-24 16:30:01 UTC. --
Feb 24 15:52:05 57e7e72 dnsmasq[1524]: dnsmasq: syntax check OK.
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: started, version 2.78 cachesize 150
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: DBus support enabled: connected to system bus
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: reading /etc/resolv.dnsmasq
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: read /etc/hosts - 6 addresses
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: reading /etc/resolv.dnsmasq
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 10.10.1010#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 10.10.10.9#53

klutchell commented 3 years ago

using nameserver 10.10.1010#53

Is this a typo on your DHCP server or just bad parsing of the journalctl logs? If your device is trying to use 10.10.1010#53 as the primary DNS server on boot, and eventually falling back to 10.10.10.9#53 that would explain the delay and the race condition.

Your device took 37s to get an updated DNS server via DHCP and my devices took 4s and 9s.

klutchell / balena-pihole

pihole doesn't respond to DNS queries without container restart #66