unbound: logging flood on wan down

dermoth commented 4 years ago

Maintainer: Eric Luehrsen ericluehrsen@gmail.com Environment: OpenWrt 18.06.1 (arm_cortex-a9_vfpv3 on WRT3200ACM), unbound-1.9.6-1

Description:

Today I had a strange network glitch where pppd and unbound restarted without any obvious reason (normally I see something about pppd exiting before it starts reconnecting.... then unbound (re?)started shortly after (7 seconds after the first pppd entry and 1 sec after the first pppd failure and retry).

While this could be an issue of its own, I wouldn't normally mind if it wasn't for the fact after this sequence of events OpenWRT started streaming 30mbps of syslog messages back at my workstation. Between 13:15:00 and 13:18:59 I have received 178MB of logs from my router, peaking at 68MB/minute when I grep 100% affected timeframe.

If that was an extended outage where I'm not aware of the issue that logging rate would have eventually filled up my workstation.

~~I'm not sure if this bug should be sent upspream, but at the very least I think OpenWRT should default to lowest verbosity level to avoid excessive logging by the server.~~ I reported this bug upstream, https://github.com/NLnetLabs/unbound/issues/224, yet pending resolution or rejection of the upstream bug we might consider lowering the verbosity level.

Here is an excerpt of the log, including all lines matching kernel, pppd, netifd and unbound:

Apr 20 13:14:45 OpenWrt kernel: [1459277.637701] ieee80211 phy1: staid 4 deleted
Apr 20 13:15:18 OpenWrt pppd[2734]: Plugin rp-pppoe.so loaded.
Apr 20 13:15:18 OpenWrt pppd[2734]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Apr 20 13:15:18 OpenWrt pppd[2734]: pppd 2.4.7 started by root, uid 0
Apr 20 13:15:18 OpenWrt pppd[2734]: PPP session is 7263
Apr 20 13:15:18 OpenWrt pppd[2734]: Connected to 8c:90:d3:1a:26:a3 via interface eth1
Apr 20 13:15:18 OpenWrt kernel: [   19.951689] pppoe-wan: renamed from ppp0
Apr 20 13:15:18 OpenWrt pppd[2734]: Using interface pppoe-wan
Apr 20 13:15:18 OpenWrt pppd[2734]: Connect: pppoe-wan <--> eth1
Apr 20 13:15:18 OpenWrt netifd: radio2 (2628): command failed: Not supported (-95)
Apr 20 13:15:18 OpenWrt netifd: radio0 (2626): command failed: Not supported (-95)
Apr 20 13:15:18 OpenWrt netifd: radio1 (2627): command failed: Not supported (-95)
Apr 20 13:15:19 OpenWrt kernel: [   20.619058] ieee80211 phy1: change: 0xffffffff
Apr 20 13:15:19 OpenWrt kernel: [   20.708291] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
Apr 20 13:15:19 OpenWrt kernel: [   20.898028] ieee80211 phy0: change: 0xffffffff
Apr 20 13:15:19 OpenWrt kernel: [   21.003830] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Apr 20 13:15:19 OpenWrt kernel: [   21.009847] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
Apr 20 13:15:19 OpenWrt kernel: [   21.017772] br-lan: port 2(wlan1) entered blocking state
Apr 20 13:15:19 OpenWrt kernel: [   21.023124] br-lan: port 2(wlan1) entered disabled state
Apr 20 13:15:19 OpenWrt kernel: [   21.028545] device wlan1 entered promiscuous mode
Apr 20 13:15:19 OpenWrt kernel: [   21.034080] br-lan: port 3(wlan0) entered blocking state
Apr 20 13:15:19 OpenWrt kernel: [   21.039427] br-lan: port 3(wlan0) entered disabled state
Apr 20 13:15:19 OpenWrt kernel: [   21.044897] device wlan0 entered promiscuous mode
Apr 20 13:15:24 OpenWrt pppd[2734]: Remote message: ^M^JYou are already logged in - access denied^M^J^J
Apr 20 13:15:24 OpenWrt pppd[2734]: PAP authentication failed
Apr 20 13:15:24 OpenWrt pppd[2734]: Connection terminated.
Apr 20 13:15:24 OpenWrt pppd[2734]: Sent PADT
Apr 20 13:15:24 OpenWrt pppd[2734]: Exit.
Apr 20 13:15:24 OpenWrt netifd: Interface 'wan' is now down
Apr 20 13:15:24 OpenWrt netifd: Interface 'wan' is setting up now
Apr 20 13:15:24 OpenWrt pppd[4615]: Plugin rp-pppoe.so loaded.
Apr 20 13:15:24 OpenWrt pppd[4615]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Apr 20 13:15:24 OpenWrt pppd[4615]: pppd 2.4.7 started by root, uid 0
Apr 20 13:15:24 OpenWrt kernel: [   26.051673] ieee80211 phy1: change: 0x100
Apr 20 13:15:24 OpenWrt kernel: [   26.064740] ieee80211 phy1: change: 0x40
Apr 20 13:15:24 OpenWrt kernel: [   26.064990] ieee80211 phy0: change: 0x100
Apr 20 13:15:24 OpenWrt kernel: [   26.074024] ieee80211 phy0: change: 0x60
Apr 20 13:15:24 OpenWrt pppd[4615]: PPP session is 7264
Apr 20 13:15:24 OpenWrt pppd[4615]: Connected to 8c:90:d3:1a:26:a3 via interface eth1
Apr 20 13:15:24 OpenWrt kernel: [   26.084285] pppoe-wan: renamed from ppp0
Apr 20 13:15:24 OpenWrt pppd[4615]: Using interface pppoe-wan
Apr 20 13:15:24 OpenWrt pppd[4615]: Connect: pppoe-wan <--> eth1
Apr 20 13:15:24 OpenWrt kernel: [   26.269983] ieee80211 phy1: change: 0x40
Apr 20 13:15:24 OpenWrt kernel: [   26.289980] ieee80211 phy0: change: 0x40
Apr 20 13:15:25 OpenWrt kernel: [   26.459973] ieee80211 phy1: change: 0x40
Apr 20 13:15:25 OpenWrt kernel: [   26.519974] ieee80211 phy0: change: 0x60
Apr 20 13:15:25 OpenWrt kernel: [   26.618006] ieee80211 phy0: change: 0x100
Apr 20 13:15:25 OpenWrt kernel: [   26.649998] ieee80211 phy1: change: 0x40
Apr 20 13:15:25 OpenWrt kernel: [   26.683856] ieee80211 phy0: change: 0x100
Apr 20 13:15:25 OpenWrt kernel: [   26.696912] ieee80211 phy0: change: 0x62
Apr 20 13:15:25 OpenWrt kernel: [   26.849976] ieee80211 phy1: change: 0x40
Apr 20 13:15:25 OpenWrt kernel: [   26.865974] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
Apr 20 13:15:25 OpenWrt kernel: [   26.872433] br-lan: port 3(wlan0) entered blocking state
Apr 20 13:15:25 OpenWrt kernel: [   26.877778] br-lan: port 3(wlan0) entered forwarding state
Apr 20 13:15:25 OpenWrt kernel: [   26.917266] br-loc: port 2(wlan0-1) entered blocking state
Apr 20 13:15:25 OpenWrt kernel: [   26.922831] br-loc: port 2(wlan0-1) entered disabled state
Apr 20 13:15:25 OpenWrt kernel: [   26.928466] device wlan0-1 entered promiscuous mode
Apr 20 13:15:25 OpenWrt kernel: [   27.005301] IPv6: ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
Apr 20 13:15:25 OpenWrt kernel: [   27.011504] br-loc: port 2(wlan0-1) entered blocking state
Apr 20 13:15:25 OpenWrt kernel: [   27.017022] br-loc: port 2(wlan0-1) entered forwarding state
Apr 20 13:15:25 OpenWrt kernel: [   27.050238] ieee80211 phy1: change: 0x40
Apr 20 13:15:25 OpenWrt unbound: default root hints (built in rootservers.net)
Apr 20 13:15:25 OpenWrt unbound: default protocol configuration
Apr 20 13:15:25 OpenWrt unbound: default memory configuration
Apr 20 13:15:25 OpenWrt unbound: default recursion configuration
Apr 20 13:15:25 OpenWrt kernel: [   27.186081] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready
Apr 20 13:15:25 OpenWrt kernel: [   27.250089] ieee80211 phy1: change: 0x40
Apr 20 13:15:25 OpenWrt netifd: Network device 'wlan0' link is up
Apr 20 13:15:25 OpenWrt netifd: Network device 'wlan0-1' link is up
Apr 20 13:15:26 OpenWrt kernel: [   27.459982] ieee80211 phy1: change: 0x40
Apr 20 13:15:26 OpenWrt kernel: [   27.649978] ieee80211 phy1: change: 0x40
Apr 20 13:15:26 OpenWrt kernel: [   27.839975] ieee80211 phy1: change: 0x40
Apr 20 13:15:26 OpenWrt kernel: [   28.029975] ieee80211 phy1: change: 0x40
Apr 20 13:15:26 OpenWrt kernel: [   28.219985] ieee80211 phy1: change: 0x40
Apr 20 13:15:26 OpenWrt kernel: [   28.298011] ieee80211 phy1: change: 0x100
Apr 20 13:15:27 OpenWrt kernel: [   28.355338] ieee80211 phy1: change: 0x100
Apr 20 13:15:27 OpenWrt kernel: [   28.368381] ieee80211 phy1: change: 0x42
Apr 20 13:15:27 OpenWrt kernel: [   28.517000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
Apr 20 13:15:27 OpenWrt kernel: [   28.523463] br-lan: port 2(wlan1) entered blocking state
Apr 20 13:15:27 OpenWrt kernel: [   28.528800] br-lan: port 2(wlan1) entered forwarding state
Apr 20 13:15:27 OpenWrt kernel: [   28.570329] br-loc: port 3(wlan1-1) entered blocking state
Apr 20 13:15:27 OpenWrt kernel: [   28.575865] br-loc: port 3(wlan1-1) entered disabled state
Apr 20 13:15:27 OpenWrt kernel: [   28.581529] device wlan1-1 entered promiscuous mode
Apr 20 13:15:27 OpenWrt kernel: [   28.658319] IPv6: ADDRCONF(NETDEV_UP): wlan1-1: link is not ready
Apr 20 13:15:27 OpenWrt kernel: [   28.664474] br-loc: port 3(wlan1-1) entered blocking state
Apr 20 13:15:27 OpenWrt kernel: [   28.669992] br-loc: port 3(wlan1-1) entered forwarding state
Apr 20 13:15:27 OpenWrt kernel: [   28.797000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1-1: link becomes ready
Apr 20 13:15:27 OpenWrt netifd: Network device 'wlan1' link is up
Apr 20 13:15:27 OpenWrt netifd: Network device 'wlan1-1' link is up
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: init module 0: iterator
Apr 20 13:15:31 OpenWrt unbound: default root hints (built in rootservers.net)
Apr 20 13:15:31 OpenWrt unbound: default protocol configuration
Apr 20 13:15:31 OpenWrt unbound: default memory configuration
Apr 20 13:15:31 OpenWrt unbound: default recursion configuration
Apr 20 13:15:31 OpenWrt unbound: [5278:0] info: start of service (unbound 1.9.6).
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:7fd::1 port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:500:a8::e port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:503:ba3e::2:30 port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:500:12::d0d port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:500:2f::f port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: remote address is 2001:dc3::35 port 53
Apr 20 13:15:31 OpenWrt unbound: [5278:0] notice: sendto failed: Permission denied

After that it kept going on while pppd was trying to reconnect... There is a small gap while I manually restarted unbound, then I stopped it to stop the spam (so I could see what was going on).

pppd reconnected shortly after the event and then I rebooted the router to make sure everything gets back to normal.

hnyman commented 4 years ago

cc @EricLuehrsen

tobiasmcnulty commented 4 years ago

I'm seeing this in 19.07.2 as well, for example just for ~5 minutes of wan6 being down:

root@OpenWrt:/tmp/log# cat messages |grep unbound|wc -l
740628
root@OpenWrt:/tmp/log# cat messages |grep unbound|head
May  9 13:41:12 OpenWrt unbound: default protocol configuration
May  9 13:41:12 OpenWrt unbound: default memory configuration
May  9 13:41:12 OpenWrt unbound: default recursion configuration
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: init module 0: iterator
May  9 13:41:17 OpenWrt unbound: [3752:0] info: start of service (unbound 1.9.6).
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: sendto failed: Permission denied
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: remote address is 2001:500:2f::f port 53
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: sendto failed: Permission denied
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: remote address is 2001:500:2::c port 53
May  9 13:41:17 OpenWrt unbound: [3752:0] notice: sendto failed: Permission denied
root@OpenWrt:/tmp/log# cat messages |grep unbound|tail
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: sendto failed: Permission denied
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: remote address is 2600:9000:5303:2b00::1 port 53
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: sendto failed: Permission denied
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: remote address is 2600:9000:5306:f000::1 port 53
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: sendto failed: Permission denied
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: remote address is 2600:9000:5306:f000::1 port 53
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: sendto failed: Permission denied
May  9 13:46:52 OpenWrt unbound: [4620:0] notice: remote address is 2600:9000:5302:d300::1 port 53
root@OpenWrt:/tmp/log# du -ms messages 
62  messages

tobiasmcnulty commented 4 years ago

It looks like this might have been fixed upstream in 1.10; see https://github.com/NLnetLabs/unbound/issues/35 / https://github.com/NLnetLabs/unbound/commit/474afc9016d34a98537a97cc94e14d329c7d8aeb.

EricLuehrsen commented 4 years ago

Yes it has been there some time. Unbound has burped out these IP6 connection events when initializing root server connections. Any Unbound restart reason and any root server infrastructure refresh can appear to pop these. It doesn't require the auth-zone: clause either.

You may also be experiencing interface restart flood from netifd. It may help to use UCI triggers to restrict which networks restart Unbound, list trigger_interface...

tobiasmcnulty commented 4 years ago

In my case I get plenty of these just after a reboot; around 5MB in total before wan6 comes online.

Is it possible to backport this fix and/or upgrade the Unbound package to 1.10 locally?

I've not contributed before but I'm happy to try to help if I can.

EricLuehrsen commented 4 years ago

See if #12141 has resolved this.

tobiasmcnulty commented 4 years ago

Unfortunately the issue still seems to persist. I upgraded all the unbound packages (opkg list-upgradable|grep unbound|cut -d' ' -f1|xargs opkg upgrade), verified 1.10 was installed, restarted unbound a couple times, but still when I issue a service network restart the logs are filled with the same "Permission denied" messages until wan6 comes up.

dermoth commented 4 years ago

I haven't had time to upgrade to 19.07 and check on my end... Could you confirm the unbound version though? try:

/proc/$(pgrep unbound)/exe -V

This will l make sure you're checking the currently running version in case it's somehow starting an older binary.

Regards

dermoth commented 4 years ago

e been fixed upstream in 1.10; see NLnetLabs/unbound#35 / NLnetLabs/unbound@474afc9.

FYI That commit was already in the 1.9.6 release... https://github.com/NLnetLabs/unbound/blob/release-1.9.6/doc/Changelog#L449

in4mer commented 4 years ago

I just experienced this on 18.06.. an update to unbound to 1.10.0 AND updating libunbound eventually fixed my problem... once I realized the old unbound was still running and printing the same spam to the logs. Make sure you only have a single PID running after the upgrade (or none, if you stop/start it).

Also worth noting uci set unbound.@unbound[0].protocol='ip4_only' appears necessary to stop ipv6 usage. ipv6 is not enabled on my network.

dermoth commented 4 years ago

I was under the impression this issue is only affecting ipv6-enabled resolvers - is it possible your old process also didn't have ipv4_only set?

tobiasmcnulty commented 4 years ago

Unfortunately it looks like I am running the new process, but I'm still seeing the issue. I could try disabling ipv6 later to see if that helps, which yes I assume it would...

# ps|grep unbound
 3952 root      1080 S    grep unbound
18094 unbound  31024 S    /usr/sbin/unbound -d -c /var/lib/unbound/unbound.conf
# /proc/18094/exe -V
Version 1.10.0

Configure line: --target=x86_64-openwrt-linux --host=x86_64-openwrt-linux --build=x86_64-pc-linux-gnu --program-prefix= --program-suffix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/lib --sysconfdir=/etc --datadir=/usr/share --localstatedir=/var --mandir=/usr/man --infodir=/usr/info --disable-nls --disable-dsa --disable-gost --enable-allsymbols --enable-ecdsa --enable-tfo-client --enable-tfo-server --with-libexpat=/builder/shared-workdir/build/sdk/staging_dir/target-x86_64_musl/usr --with-ssl=/builder/shared-workdir/build/sdk/staging_dir/target-x86_64_musl/usr --with-user=unbound --with-run-dir=/var/lib/unbound --with-conf-file=/var/lib/unbound/unbound.conf --with-pidfile=/var/run/unbound.pid --with-pthreads --with-libevent=/builder/shared-workdir/build/sdk/staging_dir/target-x86_64_musl/usr --enable-event-api
Linked libs: pluggable-libevent 2.1.11-stable (it uses epoll), OpenSSL 1.1.1g  21 Apr 2020
Linked modules: dns64 respip validator iterator
TCP Fastopen feature available

BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

tobiasmcnulty commented 4 years ago

Just confirming that setting protocol='ip4_only' does avoid the log spam for me.

EricLuehrsen commented 4 years ago

@dermoth, would you consider your issue closed then?

dermoth commented 4 years ago

From what I can tell, no. I disabled logging for now as I didn't have time to upgrade. The upstream patch that was reportedly fixing this is already in the current release I'm using, so it's unlikely the upgrade will help.

One option would be disabling logging by default. Or I could try compiling latest package on 18.06 and try reproducing it. FWIW the upstream bug is still open...

tobiasmcnulty commented 4 years ago

I would agree, it is not fixed. Ideally the DNS resolver would not need to be prohibited from using IPv6. Not that I have any idea what the issue is now. :slightly_frowning_face:

EricLuehrsen commented 4 years ago

(1) if you configure the server for IPv6, and it doesn't have an IPv6 route, then its attempts to make IPv6 connections rebound and get logged. If WAN and WAN6 have unusual delays between them, then this can naturally happen. If you don't have IPv6, then you need to disable it in Unbound. (2) Unbound was attempting to contact root servers anyway possible regardless of config when it started, and this at least it seems is cleaned up. Conclusion: as a derivative package, there is nothing more to be done in within OpenWrt. This needs a verified upstream fix. Its more than an easy patch. Unbound fundamentally uses all interfaces provided, and this issue stems from there.

Note. My tip way up was incomplete. _"You may also be experiencing interface restart flood from netifd. It may help to use UCI triggers to restrict which networks restart Unbound, list trigger_interface... "_ Something was done a few years ago to ifup events between netifd and procd. IPv6 minor data refresh from your ISP will cause procd to issue an interface up trigger to applications it controls, even every 2 minutes. You need to remove WAN6 from Unbound triggers. You can search core OpenWrt fly spray for more.

dermoth commented 4 years ago

If Unbound flooded the logs between the time ipv4 and ipv6 went up it probably went unnoticed. I noticed the issue during an outage and if ti lasted without any fix my disk would've probably filled up. FWIW I use 6in4 tunneling.

If you don't want to keep the openwrt bug open while there is an open upstream bug (I opened both...) I will reopen one when the issue is fixed upstream.

openwrt / packages

unbound: logging flood on wan down #11904