Open eiddor opened 3 years ago
Thanks for the logs, that's super helpful! I see that on first boot it is unable to update any of the blocklists (Status: Connection Refused
) but on restart the updates work (Status: No changes detected
).
On it's own this wouldn't be a problem but it indicates that on first start the pihole container is having connection issues. Have you modified the docker-compose file at all? Can you share some of your env vars?
DNSMASQ_LISTENING
INTERFACE
DNS1
DNS2
Also could you try setting the ServerIP
variable to the LAN IP of your device and see if the behavior changes?
Wow, good catch - I'm not sure how I missed that difference. I saw some "Connection Refused" messages but didn't notice that it was for every blocklist and kind of got used to seeing them.
No modification to docker-compose, especially while testing these branches for you. Once I get a stable version, I do change some stuff in pihole/Dockerfile to pull in a hosts file and change the font size, but again I've been careful not to do that while testing.
Environment variables:
DNS1 208.67.222.222
DNS2 208.67.220.220
DNSMASQ_LISTENING eth0
INTERFACE eth0
TZ America/Chicago
Just added ServerIP
and I'm testing now. Is that a recent addition?
No luck after adding ServerIP
- I was really hoping for it to be that simple :-)
Just to add - there's nothing special about my network:
The RPi3b is wired directly into a Cisco 9200 switch with no auth configured, so the port comes up instantly.
ServerIP
has been required since v4.2.2. I'll remove the optional label from the README. I also see a bunch of other env vars that are being deprecated so I'll have to clean that up soon.
https://github.com/pi-hole/docker-pi-hole#deprecated-environment-variables
Starting with the v4.1.1 release your Pi-hole container may encounter issues starting the DNS service unless ran with the following setting:
--dns=127.0.0.1 --dns=1.1.1.1
The second server can be any DNS IP of your choosing, but the first dns must be 127.0.0.1
https://github.com/pi-hole/docker-pi-hole#docker-pi-hole-v411
You're not blocking 1.1.1.1
on your network in any way?
Oh, man - Their Docker hub docs are a bit out of sync with the GutHub docs - https://hub.docker.com/r/pihole/pihole. I need to revisit my "new" pihole device now (even though it's working fine).
Also, I'm reading this as ServerIP
not being required unless we're running host mode. Does Balena run host mode by default? (Should it?)
Docker Pi-Hole v4.2.2 ServerIP no longer a required environment variable unless you run network 'host' mode! Feel free to remove it unless you need it to customize lighttpd
As for the DNS requirement - This is where my knowledge (or lack of) of Balena breaks down. Looking at docker-compose.yml
it seems that we're covered:
dns:
- "127.0.0.1"
- "1.1.1.1"
environment:
DNSMASQ_LISTENING: eth0
INTERFACE: eth0
DNS1: 127.0.0.1#5053
DNS2: 127.0.0.1#5053
Now, I'm guessing that the environment variables in the GUI override these settings (where does that magic happen?).
The thing is, I want to run OpenDNS as my upstream DNS (I work for Cisco). How do we reconcile that with the 127.0.0.1 requirement?
With the new method, I guess it would be like this:
PIHOLE_DNS_:
127.0.0.1#5053;208.67.222.222;208.67.220.220
Now, do I have to wait until Balena Pihole gets an update before I can use that?
(1.1.1.1 is fully accessible in my network, I use it all the time to test web redirect portals)
I decided to try to set DNS1
to 127.0.0.1
and am still having the problem, so it wasn't that.
FYI - on my dedicated RPi running pihole, this syntax worked for me and pihole was even "smart" enough to recognize OpenDNS as the upstream
PIHOLE_DNS_
= 127.0.0.1;208.67.222.222;208.67.220.220
Also, I'm reading this as ServerIP not being required unless we're running host mode. Does Balena run host mode by default? (Should it?)
Balena does not run in host mode by default, but this Pi-Hole stack does in order to avoid port conflicts.
^^ So as you can see from the line above, ServerIP
should be provided for this stack.
^^ These lines specify the container DNS, and aren't directly related to Pi-Hole, but according to their docs we want values similar to this to make sure Pi-Hole can start correctly (needs to resolve before it can resolve!).
In fact, since these are only used at container startup and not to perform actual runtime queries, you could replace the 1.1.1.1
with 208.67.222.222
to use OpenDNS there as well. As long as 127.0.0.1
is first.
^^ These DNS
lines are overwritten if you provide alternate values in the dashboard. So by providing 208.67.222.222
and 1208.67.220.220
you are effectively using OpenDNS for your upstream (not related to dns:
above).
Now, I'm guessing that the environment variables in the GUI override these settings (where does that magic happen?).
The balena supervisor will apply dashboard environment variables to the services and override existing ones.
You can go ahead and start using PIHOLE_DNS_=208.67.222.222;208.67.220.220
and remove the two DNS
lines from your compose file and dashboard now if you want. I already have for my setup.
Thanks for the thorough explanation, @klutchell !
You can go ahead and start using PIHOLEDNS=208.67.222.222;208.67.220.220 and remove the two DNS lines from your compose file and dashboard now if you want. I already have for my setup.
Done.
So it looks like after all of this we've:
a) taught Roddie a lot about how Balena works
b) figured out that we need to update some environment variables and docs in the project
Not a waste for me at all, but it seems like I'm still at square one and you ended up working this weekend anyway.
Nah dude, I still got lots of gaming and dog walks in this weekend. We're all good.
What if you changed the second entry here to be something else, like google or opendns? Any difference?
Once I get a stable version, I do change some stuff in pihole/Dockerfile to pull in a hosts file and change the font size
Can you clarify these steps a bit? Anything that would impact the host OS that you would know of? I wonder if this would happen with a fresh image, or another device, to isolate the issue to the environment/network or the device.
Pro Tip: if you want to learn a ton more about how balena works, these masterclasses are killer and actually pretty fun. https://www.balena.io/docs/learn/more/masterclasses/overview/
Nah dude, I still got lots of gaming and dog walks in this weekend. We're all good.
Ha! As long as the wife isn't blaming me for gaming, we're ok.
What if you changed the second entry here to be something else, like google or opendns? Any difference?
No difference.
Can you clarify these steps a bit? Anything that would impact the host OS that you would know of? I wonder if this would happen with a fresh image, or another device, to isolate the issue to the environment/network or the device.
Harmless, I believe:
RUN echo "addn-hosts=/etc/pihole/lan.list" | sudo tee /etc/dnsmasq.d/02-lan.conf
RUN curl -sSL https://x.x.x.x/roddie/hosts -o /etc/pihole/lan.list
ENV FONTFACE Terminus
ENV FONTSIZE 10x20
Before I even opened the initial display issue I started a new application with a new device to make sure it wasn't something left-over from the last year or three. Happy to try again and only start with the dtoverlay branch. It'll take me a half hour or so.
Pro Tip: if you want to learn a ton more about how balena works, these masterclasses are killer and actually pretty fun. https://www.balena.io/docs/learn/more/masterclasses/overview/
Excellent! Can't wait to go through them!
@klutchell Fresh application and device, using the dtoverlay
branch with no modifications. Same problem.
Okay, thanks for testing. You're correct that your production modifications are harmless as well. Though if you wanted you could also set FONTFACE=Terminus
and FONTSIZE=10x20
in the balena dashboard and skip changing the Dockerfile for those. The supervisor will use the dashboard values above all else.
For your lan hosts those can be added manually via the Pi-Hole dashboard under Local DNS -> DNS Records but your way is faster since you already have the list hosted somewhere.
Let me think about your container startup issue some more, nothing obvious is coming to me right now.
But remember that we aren't at square one cause we fixed your PADD display (and mine) and we noticed that some Pi-Hole env vars are being deprecated so we are tracking that. Also a learning experience isn't a waste in this line of work.
Though if you wanted you could also set FONTFACE=Terminus and FONTSIZE=10x20 in the balena dashboard and skip changing the Dockerfile for those. The supervisor will use the dashboard values above all else.
Good to know! Is that documented somewhere? I only noticed those settings in pihole/Dockerfile
a couple of months ago. It's nice to finally use all of my screen real estate :-)
For the LAN hosts there was a whole discussion about it in the comments section of the original balena-pihole blogpost from a few years ago and a few different options were thrown around. It's my entire network, which is about 90 hosts and this is just a hosts file that I have always maintained on my "main" server so that I can quickly find something if needed. It keeps things semi-automated, at least.
Related: I learned what Docker was due to that blog post and the related comment thread, and ended up writing a whole blog series about it.
Let me think about your container startup issue some more, nothing obvious is coming to me right now.
I appreciate it - Let me know if you want me to test anything. Two things are nagging at me:
1) I don't like host mode - I'm sure it's not causing this problem, but I really don't like host mode.
2) Something in the back of my mind keeps telling me that I see this connectivity issue whenever we have display-related changes. I'll play with some of them to see if I can get it to "unbreak."
But remember that we aren't at square one cause we fixed your PADD display (and mine) and we noticed that some Pi-Hole env vars are being deprecated so we are tracking that. Also a learning experience isn't a waste in this line of work.
True true - And I got my 3rd or 4th balena project PR submitted.
If you'd like to try it without host mode you can replace the following line...
... with the below lines using your device's LAN IP.
ports:
- "LAN_IP:53:53/tcp"
- "LAN_IP:53:53/upd"
This is to prevent the balena engine from binding to 53 on ALL interfaces, because 53 is already in use on the balena VPN interface.
Then you should also remove the following lines from the pihole Dockerfile:
After that you'll no longer be running in host mode and we can see if the behaviour changes.
You can also try resolving this error ...
::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Temporary failure in name resolution
... by adding 127.0.0.1 7ed1bcc
to your hosts file or in the Pi-hole dashboard. Normally that error doesn't seem to have any impact but that's how I cleared it on my devices.
After that you'll no longer be running in host mode and we can see if the behaviour changes.
So, I actually ended up trying this last night (I added 80/443 as well for the GUI), and I couldn't get it past this state:
Just redid it using only the 53s and got the same results. I actually had to do a "Purge data" to get it to take another push and reload again. I don't want to spend too much time on the host mode thing. If anything it should make the network connectivity from the container better, not worse. Just a sticking-point with me since networking is actually my happy place.
::: Testing pihole-FTL DNS: sudo: unable to resolve host 7ed1bcc: Temporary failure in name resolution
Got rid of that last night, too - No change. That one has always irritated me.
I just noticed that I didn't have dtoverlay defined (it's usually set to rpi-display
), but the display has been working the whole time. Not really meaningful, just interesting.
When you tried it did you specify the LAN IP in your ports? This part is very important to not conflict with balena services running on port 53 and may be why you ended up in a bad state?
ports:
- "80:80/tcp"
- "LAN_IP:53:53/tcp"
- "LAN_IP:53:53/upd"
I tried both ways with every combination - Maybe there's another port that balena is expecting? I may come back to the host mode thing later once we figure out the problem, but in my mind host mode should make things a bit more open here and I don't want to get too distracted with it.
I'm currently in a non-working state and here's what I'm seeing:
ServerIP
root@57e7e72:~# nslookup google.com
Server: 127.0.0.2
Address 1: 127.0.0.2 57e7e72
Name: google.com
Address 1: 172.217.1.238 lax17s02-in-f14.1e100.net
Address 2: 2607:f8b0:4000:814::200e dfw25s27-in-x0e.1e100.net
root@57e7e72:~# nslookup google.com 10.10.10.10
Server: 10.10.10.10
Address 1: 10.10.10.10 57e7e72
nslookup: can't resolve 'google.com'
root@57e7e72:~# nslookup google.com 127.0.0.1
Server: 127.0.0.1
Address 1: 127.0.0.1 57e7e72
Name: google.com
Address 1: 172.217.9.174 dfw25s27-in-f14.1e100.net
Address 2: 2607:f8b0:4000:814::200e dfw25s27-in-x0e.1e100.net
Docker doesn't seem to think 53 is bound anywhere else:
root@57e7e72:~# balena ps -a --format="table {{.ID}}\t{{.Names}}\t{{.Ports}}\t{{.Status}}"
CONTAINER ID NAMES PORTS STATUS
71fa5916cc18 pihole_3310990_1706164 Up 35 minutes (healthy)
6221682a7f82 dbus_3310993_1706164 Exited (0) 34 minutes ago
da3f01c7f402 fbcp_3310992_1706164 Up 35 minutes
d6762440786d unbound_3310991_1706164 0.0.0.0:5053->5053/tcp, 0.0.0.0:5053->5053/udp Up 35 minutes
f108ef13311c resin_supervisor Up 34 minutes (healthy)
Simply restarting the pihole container makes everything work.
Very odd, sorry if it feels like we are spinning our wheels here but I really want to find out what changes in the container between restarts. Can you check the content of /etc/dnsmasq.conf
before and after the restart?
Also, I'm not sure what state you are in but if we are still using host networking what happens if you just remove the following line from your Dockerfile which will allow binding to all interfaces?
Oh no, I absolutely appreciate this and since it does seem to be network-adjacent, I am enjoying it. Plus I only use two balena apps, so the more I get to tinker with them the better. I think we're getting closer.
Check my logic here:
ServerIP:53 isn't open to the container. Docker doesn't seem to see a problem, but because we're using host networking, we don't see the explicit port as open.
ServerIP:80 works fine, so it's not an IP-binding issue. 127.0.0.1:53 works fine from the host, so it's not a DNS process issue.
Reproducing this is just a matter of doing a shutdown and power-on, so it's easy for me to test whatever can think of.
/etc/dnsmasq.conf
content is the same regardless of state:
conf-dir=/etc/dnsmasq.d
bind-interfaces
user=root
Commenting out that line puts the pihole container into a very unhappy loop:
22.02.21 13:25:10 (-0600) pihole dnsmasq: failed to create listening socket for port 53: Address already in use
22.02.21 13:25:10 (-0600) pihole [cont-init.d] 20-start.sh: exited 1.
22.02.21 13:25:10 (-0600) pihole [cont-finish.d] executing container finish scripts...
22.02.21 13:25:10 (-0600) pihole [cont-finish.d] done.
22.02.21 13:25:10 (-0600) pihole [s6-finish] waiting for services.
22.02.21 13:25:10 (-0600) pihole [s6-finish] sending all processes the TERM signal.
22.02.21 13:25:13 (-0600) pihole [s6-finish] sending all processes the KILL signal and exiting.
22.02.21 13:25:15 (-0600) Service exited 'pihole sha256:761dcf61b473bc430c0f2c473ea4ed0303a8ee8da9f56b33a65c3b0b527f5b4d'
Okay, that's the behaviour I was expecting and why that line exists. Can you confirm that the interface provided for INTERFACE
and DNSMASQ_LISTENING
matches the interface name on the host for your wired connection?
It seems almost like dnsmasq is binding to the wrong interface on first startup (like localhost only), but finding the correct one on a container restart. That would be very weird though, and somehow related to a race condition on startup or a delayed network event? I don't suppose you have a secondary network where you could connect one of these devices? Or even try wifi to see if the behaviour changes? I know I'm grasping at straws now.
Confirmed they're both eth0
You read my mind on dnsmasq (I think we got there at the same time) - I'm collecting some lsof
output at the moment. Give me a few minutes.
lsof
output in a "broken" state:
root@57e7e72:~# lsof -iTCP -sTCP:LISTEN
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 53u IPv6 16019 0t0 TCP *:22222 (LISTEN)
dnsmasq 1531 nobody 5u IPv4 20594 0t0 TCP 57e7e72:domain (LISTEN)
dnsmasq 1531 nobody 7u IPv4 20596 0t0 TCP 57e7e72:domain (LISTEN)
balena-en 1849 root 6u IPv6 23559 0t0 TCP *:5053 (LISTEN)
node 2798 root 24u IPv6 27443 0t0 TCP *:48484 (LISTEN)
lighttpd 2938 www-data 4u IPv4 27285 0t0 TCP *:http (LISTEN)
lighttpd 2938 www-data 5u IPv6 27286 0t0 TCP *:http (LISTEN)
pihole-FT 2948 root 5u IPv4 25471 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 2948 root 7u IPv6 25473 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 2948 root 9u IPv6 25475 0t0 TCP localhost:domain (LISTEN)
pihole-FT 2948 root 12u IPv6 26154 0t0 TCP localhost:4711 (LISTEN)
pihole-FT 2948 root 17u IPv4 25481 0t0 TCP 57e7e72:4711 (LISTEN)
root@57e7e72:~# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
chronyd 1424 root 5u IPv4 19470 0t0 UDP *:1234
chronyd 1424 root 6u IPv6 19471 0t0 UDP *:1234
avahi-dae 1513 avahi 11u IPv4 19104 0t0 UDP *:5353
avahi-dae 1513 avahi 12u IPv6 19105 0t0 UDP *:5353
avahi-dae 1513 avahi 13u IPv4 19106 0t0 UDP *:62747
avahi-dae 1513 avahi 14u IPv6 19107 0t0 UDP *:49811
NetworkMa 1523 root 18u IPv4 28875 0t0 UDP 10.10.10.10:68
dnsmasq 1531 nobody 4u IPv4 20593 0t0 UDP 10.114.102.1:53
balena-en 1864 root 6u IPv6 21263 0t0 UDP *:5053
pihole-FT 2948 root 6u IPv6 25472 0t0 UDP [fe80::84ae:1c5f:e361:59d2]:53
lsof
output in a "working" state:
root@57e7e72:~# lsof -iTCP -sTCP:LISTEN
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 53u IPv6 16019 0t0 TCP *:22222 (LISTEN)
dnsmasq 1531 nobody 5u IPv4 20594 0t0 TCP 57e7e72:domain (LISTEN)
dnsmasq 1531 nobody 7u IPv4 20596 0t0 TCP 57e7e72:domain (LISTEN)
balena-en 1849 root 6u IPv6 23559 0t0 TCP *:5053 (LISTEN)
node 2798 root 24u IPv6 27443 0t0 TCP *:48484 (LISTEN)
lighttpd 7246 www-data 4u IPv4 41012 0t0 TCP *:http (LISTEN)
lighttpd 7246 www-data 5u IPv6 41013 0t0 TCP *:http (LISTEN)
pihole-FT 7254 root 5u IPv4 40101 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 7254 root 7u IPv4 40103 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 7254 root 9u IPv6 40105 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 7254 root 11u IPv6 40107 0t0 TCP localhost:domain (LISTEN)
pihole-FT 7254 root 14u IPv4 39862 0t0 TCP 57e7e72:4711 (LISTEN)
pihole-FT 7254 root 17u IPv6 39865 0t0 TCP localhost:4711 (LISTEN)
root@57e7e72:~# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
chronyd 1424 root 5u IPv4 19470 0t0 UDP *:1234
chronyd 1424 root 6u IPv6 19471 0t0 UDP *:1234
avahi-dae 1513 avahi 11u IPv4 19104 0t0 UDP *:5353
avahi-dae 1513 avahi 12u IPv6 19105 0t0 UDP *:5353
avahi-dae 1513 avahi 13u IPv4 19106 0t0 UDP *:62747
avahi-dae 1513 avahi 14u IPv6 19107 0t0 UDP *:49811
NetworkMa 1523 root 18u IPv4 28875 0t0 UDP 10.10.10.10:68
dnsmasq 1531 nobody 4u IPv4 20593 0t0 UDP 10.114.102.1:53
balena-en 1864 root 6u IPv6 21263 0t0 UDP *:5053
pihole-FT 7254 root 4u IPv4 40100 0t0 UDP 10.10.10.10:53
pihole-FT 7254 root 8u IPv6 40104 0t0 UDP [fe80::84ae:1c5f:e361:59d2]:53
To save you some squinting, there are two differences here that I'm seeing (you may already be ahead of me with your thinking).
One of these IPv4 TCP:53 listeners is missing when things are broken (I don't know why there are two):
pihole-FT 7254 root 5u IPv4 40101 0t0 TCP 57e7e72:domain (LISTEN)
pihole-FT 7254 root 7u IPv4 40103 0t0 TCP 57e7e72:domain (LISTEN)
More importantly, this UDP:53 listener is missing when things are broken:
pihole-FT 7254 root 4u IPv4 40100 0t0 UDP 10.10.10.10:53
In both cases, dnsmasq is listening on the containerip:53
but that doesn't change and shouldn't matter.
If dnsmasq is getting in the way at some point, it's gone by the time I have access to the console.
I also don't like that whatever is failing is doing so silently. Are there any other logs that I can collect that might help?
Good findings, I think we are narrowing it down.
I've never tried this before now, but there seems to be a lot of info in the debug logs tool in the Pi-hole dashboard.
https://discourse.pi-hole.net/t/how-do-i-debug-my-pi-hole-installation/3104
Might be worth grabbing these before and after a restart and comparing them. The farther we get into FTL and dnsmasq territory the less help I'll be, but I'm happy to learn!
Yeah, I think we're close. I've been trying various things while on calls, but I don't think I've learned anything.
That page looks to be more about secure transport of the logs /var/log
than anything else - Unless I'm missing something. I'll have to dig some more and see exactly where I need to submit a debug token.
FWIW I have looked at /var/log/pihole.log
and /var/log/pihole-FTL.log
- nothing interesting.
Also nothing in dmesg
of the pihole container - nothing.
Unsurprisingly, host networking is still nagging at me - I wonder if we can eliminate that, if we'd be able to see the binding in Docker (or an error)?
The fact that we can see a difference between lsof
outputs is encouraging, I just can't help but think that somewhere somehow there's a Docker log that is waiting for us to look at it.
No need to submit with a token, just click the button to generate them and a surprising amount of information is collected. Not just logs but simple diagnostic tests. Then you can copy them somewhere and compare.
The balena engine logs can be viewed with journalctl -u balena -a
.
The balena engine logs can be viewed with journalctl -u balena -a.
I bet that's covered in the Masterclass which is tab number 28 in my browser :-)
Nothing interesting there - Everyone thinks everyone else is happy.
No need to submit with a token, just click the button to generate them and a surprising amount of information is collected. Not just logs but simple diagnostic tests. Then you can copy them somewhere and compare.
Duh - That makes a whole lot more sense than what I was thinking. Ok, I'll grab them and go through them tonight/tomorrow and let you know if anything jumps out.
Thanks again for all of your help here!
BTW - This is from within the container:
Not working:
root@57e7e72:/usr/src/app# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pihole-FT 526 root 6u IPv6 26222 0t0 UDP [fe80::84ae:1c5f:e361:59d2]:53
Working:
root@57e7e72:/usr/src/app# lsof -iUDP -P -n | egrep -v '(127|::1)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pihole-FT 544 root 4u IPv4 89624 0t0 UDP 10.10.10.10:53
pihole-FT 544 root 8u IPv6 89628 0t0 UDP [fe80::84ae:1c5f:e361:59d2]:53
So, a lot of good info in these logs - Nothing too in the weeds and everything is pretty self-explanatory.
The only difference is kind of disappointing, though. An extra port "in use" which we already know about.
Not working:
*** [ DIAGNOSING ]: Ports in use
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
Working:
*** [ DIAGNOSING ]: Ports in use
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
The rest of the debug output is virtually identical. I'll try again tomorrow to make sure I'm not just really tired and missing something, but it seems like whatever is happening is being ignored by everything.
A few ideas --
1) Is there a way to delay the pihole container's start and manually start it to see if it gives an error? If it works, we'll know that it's more of a timing/sequence issue.
2) I figure out how to disable host mode and troubleshoot directly with the ports via docker.
3) I strip all of the display-related stuff from my config and eliminate that as a variable.
Is there a way to delay the pihole container's start and manually start it to see if it gives an error? If it works, we'll know that it's more of a timing/sequence issue.
Adding the following line to the bottom of your pihole service definition in the docker-compose file will not delay the container start, but it will delay any services running in the container.
entrypoint: /bin/bash
command:
- -c
- "sleep 60 && /s6-init"
Delaying the container start is a bit more tricky but we will see if that is even required.
I figure out how to disable host mode and troubleshoot directly with the ports via docker.
The following changes worked for my device:
# network_mode: host
ports:
- 192.168.1.160:53:53/tcp
- 192.168.1.160:53:53/udp
- 192.168.1.160:80:80/tcp
DNSMASQ_LISTENING
and INTERFACE
to all
in the dashboardI strip all of the display-related stuff from my config and eliminate that as a variable.
Just remove the fbcp
and dbus
services from docker-compose and delete pihole/services/padd/run
. You can also remove the unbound
service definition since you aren't using it.
Oh, good - A few things to try today. Thank you! I'll re-run the pihole debugs first to make sure I didn't miss anything, and then I'll decide which order to try the rest in.
Did I read the last main
commit correctly? Is it all caught up and safe for me to use now?
Nope, hasn't been merged yet - I'll keep using dtoverlay
.
Yeah, I'm waiting on the balenablocks/fbcp PR to be merged but I'd like someone else to take a look at it first.
Here are the debugs in case you want to check my work - I'm actually impressed with how these are laid-out and annotated. There's some really cool logic here that even I understand.
pihole_debug_not_working.txt pihole_debug_working.txt
I'm going to tackle this in this order:
1) Pare back services (PADD, unbound, fbcp, dbus) one at a time
2) Delay the services in the pihole container to see if it works. If it doesn't, I'll attempt to run them manually and see if anything complains.
3) Disable host mode and try to collect some Docker output.
So - Good news, I suppose. It's not related to any of the other services. I'm only running pihole right now and it's still breaking.
I really hope it's nothing in my environment, but I can't see how it would be. Connectivity is there, the GUI works even when DNS doesn't.
Next step is to delay the services in the container itself.
More "good" news, only because it gets us closer. The sleep works.
So now we have a timing issue somewhere... All of the other containers are gone, so what's left? The Supervisor and the pihole container?
The problem I have right now is that I can't get into the container to manually run s6-init
before whatever condition goes away.
I'm currently messing with the sleep timer to see if I can find the "breaking" point, but that's probably not going to help much.
On the bright side - host mode is in the clear (for now).
I would still try without host mode if possible. If something in the network stack or environment is changing in the first few seconds of container start, it will be nice to isolate that a little more.
Knowing how long of a delay will workaround the issue is interesting but not super helpful as you said.
Knowing how long of a delay will workaround the issue is interesting but not super helpful as you said.
~11 seconds heh. That took me way too long to narrow down, but that's about where it is.
Going for non-host mode now.
So, remember that nagging thought that I've mentioned once or twice?
Disabling host mode gets rid of the issue. I'm on my tenth test just to make sure, but yeah... everything is working fine following a shutdown/restart. I started with a fresh archive and only disabled host mode, so everything else is in-tact.
Now the question is: what can/should we do about it? Whatever is happening should show up in the Docker/balena logs, right?
Also, just to eliminate my environment a bit more - I took my other RPi3B running Pi-hole in Docker (non-Balena), and switched it to host networking. No problems.
Yup, that tells me that some other interface is being created or binding at the same time the Pi-hole container is starting on boot. Maybe something related to the supervisor and/or balena vpn but I'll have to ask internally to get a hint of which direction to look.
I wish we could push the non-host network option to everyone but manually entering the ServerIP in the docker-compose.yml is a lot of friction for a new user. We also don't have any way to substitute environment variables in the compose file at runtime either.
Perhaps we should try some tests with non-host mode as we have now, but don't specify the ServerIP in the port binding?
# network_mode: host
ports:
- 53:53/tcp
- 53:53/udp
- 80:80/tcp
In my experience that usually results in one of two things:
So if this is indeed a race condition, and we can trigger the 2nd outcome, I would expect more useful logs in the balena supervisor.
journalctl -u resin-supervisor
I can try this test as well to try to reproduce the 2nd case. We should also remove the sleep for this test.
Maybe something related to the supervisor and/or balena vpn but I'll have to ask internally to get a hint of which direction to look.
Right - That was my next question: does belena do anything special with UDP 53 at startup?
I also don't like the fact that I seem to be the only one having a problem.
I wish we could push the non-host network option to everyone but manually entering the ServerIP in the docker-compose.yml is a lot of friction for a new user. We also don't have any way to substitute environment variables in the compose file at runtime either.
I've been thinking about this one. Host networking effectively binds to , right? Earlier you had me specify LAN_IP so we wouldn't step on the 53 binding that Balena VPN needed. Wouldn't they be fighting over it with host networking enable? (maybe leading to my problem?)
I agree that we can't push it to the user, but there's gotta be an answer here somewhere.
Perhaps we should try some tests with non-host mode as we have now, but don't specify the ServerIP in the port binding?
Yep, that makes sense - I'll work on this tomorrow.
Sleep is long-gone - I'm currently sitting at dtoverlay
with host networking disabled.
Thanks again for your help! I think we're there and just need a good answer :-)
I've been thinking about this one. Host networking effectively binds to , right? Earlier you had me specify LAN_IP so we wouldn't step on the 53 binding that Balena VPN needed. Wouldn't they be fighting over it with host networking enable? (maybe leading to my problem?)
I think the issue is in this area, yes. The dnsmasq conf entry we are adding is supposed to ensure dnsmasq only binds to the interfaces we specify (eth0
in this case) and up until now no one has reported any issue with that not working as intended.
I pinged the whole balena team internally so I may have some suggestions by tomorrow. I'm also hoping that if we can get the container to bind to 53 on all interfaces the device logs will give us a clue as to where the conflict is.
One idea that was floated was to run lsof -nP -iTCP:53 -sTCP:LISTEN
in a tight loop on startup. Preferably on the host but maybe easier in the pihole container (or a new service) initially.
I'll leave it to you to replace the pihole init command with a little bash script, or you can wait for me to give it a try tomorrow and I can provide my steps.
One idea that was floated was to run lsof -nP -iTCP:53 -sTCP:LISTEN in a tight loop on startup. Preferably on the host but maybe easier in the pihole container (or a new service) initially.
I thought of that a couple of days ago, but I was hoping for a way to get into the device to run it manually before things settle. Your idea is much better, but I don't believe lsof inside of a container will see ports on the host.
Let me flip to host networking and test that first.
Yeah, no good inside of a container -
Host:
root@57e7e72:~# lsof -nP -iTCP:5053 -sTCP:LISTEN
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
balena-en 1840 root 6u IPv6 22086 0t0 TCP *:5053 (LISTEN)
Container:
root@57e7e72:/usr/src/app# lsof -nP -iTCP:5053 -sTCP:LISTEN
root@57e7e72:/usr/src/app#
So now we're getting a bit above my pay grade, but I'll try to read through this thread and this thread to see what I can figure out/learn.
Running the script is one challenge, but also... where will the output go? I guess I could redirect it to /tmp/something.log
I also need to figure out how to fork said script.
I have a headache.
I got as far as "build your own balena image" and stopped. We might need a better mousetrap.
I think I have a better idea of what's happening now. The dnsmasq on the host is only supposed to bind to the VPN interface and a special loopback, so there shouldn't be an issue there. I did consider that the host is not using the bind-interfaces
option and we could try adding that option your device only as a test if we want. I can provide quick instructions.
root@2cac2ee:~# lsof -nP -iTCP:53 -sTCP:LISTEN
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq 1366 nobody 5u IPv4 15269 0t0 TCP 10.114.102.1:53 (LISTEN)
dnsmasq 1366 nobody 7u IPv4 15271 0t0 TCP 127.0.0.2:53 (LISTEN)
On my devices there is a delay in the logs between using the default balena upstream DNS (8.8.8.8
) and my DHCP assigned DNS (192.168.1.55
). This delay probably explains the race condition, since your container must be starting before an updated DHCP ack is received.
root@2cac2ee:~# journalctl -u dnsmasq
-- Logs begin at Tue 2020-08-04 14:03:36 UTC, end at Wed 2021-02-24 14:00:32 UTC. --
Feb 24 10:14:05 2cac2ee dnsmasq[1361]: dnsmasq: syntax check OK.
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: started, version 2.78 cachesize 150
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: DBus support enabled: connected to system bus
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: reading /etc/resolv.dnsmasq
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: read /etc/hosts - 6 addresses
Feb 24 10:14:05 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: reading /etc/resolv.dnsmasq
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 8.8.8.8#53
Feb 24 10:14:14 2cac2ee dnsmasq[1366]: dnsmasq[1366]: using nameserver 192.168.1.55#53
I'd like to see your dnsmasq logs before anything else, since this may indicate why your devices behave differently. It crossed my mind that you are using a 10.x.x.x
subnet but I sure hope that's not related or we have much bigger problems.
If you want to try adding the bind-interfaces
option to the host you can do the following. Note that this is just a test and it's not recommended to modify the host OS as all changes are lost on an OS upgrade.
mount -o remount,rw /
cp /etc/dnsmasq.conf /etc/dnsmasq.conf.orig
echo bind-interfaces >> /etc/dnsmasq.conf
sync
reboot
My dnsmasq logs:
-- Logs begin at Wed 2021-02-24 15:51:59 UTC, end at Wed 2021-02-24 16:30:01 UTC. --
Feb 24 15:52:05 57e7e72 dnsmasq[1524]: dnsmasq: syntax check OK.
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: started, version 2.78 cachesize 150
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: DBus support enabled: connected to system bus
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: reading /etc/resolv.dnsmasq
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: read /etc/hosts - 6 addresses
Feb 24 15:52:05 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: reading /etc/resolv.dnsmasq
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 8.8.8.8#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 10.10.1010#53
Feb 24 15:52:42 57e7e72 dnsmasq[1530]: dnsmasq[1530]: using nameserver 10.10.10.9#53
using nameserver 10.10.1010#53
Is this a typo on your DHCP server or just bad parsing of the journalctl logs? If your device is trying to use 10.10.1010#53
as the primary DNS server on boot, and eventually falling back to 10.10.10.9#53
that would explain the delay and the race condition.
Your device took 37s to get an updated DNS server via DHCP and my devices took 4s and 9s.
After a full restart of my RPi3 (shutdown and power off/on), pihole doesn't respond to DNS queries until I restart the pihole container again. Logs/testing output below:
Logs from power-on:
DNS query:
Logs following pihole container restart:
DNS query following pihole container restart: