100% CPU usage after some time, cannot query bandwidth usage

hsgg commented 5 years ago

nlbwmon seems to go into an infinite loop, using ~50% cpu, refusing to service the frontend. This happens after a few days, or sometimes a few weeks. I'm on MIPS, R6220. I see the same sort of strace as mentioned in a forum thread: lots of

recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=164, type=0x100 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x3c\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00\xc0\xa8\x01\xf6\x08\x00\x02\x00\xc0\xa8\x01\x01\x24\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 164
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

There are many more of the EAGAIN messages than the ones with more information. The command nlbw -c list just hangs:

root@OpenWrt:~# nlbw -c list

jow- commented 5 years ago

This should be fixed with 4574e6e80e3a5ab8de65886baa0d563dfee589ff now - please reopen if it still occurs.

hsgg commented 5 years ago

Thank you! I updated my router, and will report back in case it occurs again.

PieterBreugelmans commented 4 years ago

This week I proceed to upgrade my R7800 with the latest master build from hnyman and I'm facing the same problem reported in this issue.

OpenWrt SNAPSHOT r11954-7936cb94a9 / LuCI Master git-20.010.59544-ff12aa3

root@jupiter:~# top -b Mem: 146220K used, 330680K free, 2128K shrd, 6352K buff, 22876K cached CPU: 0% usr 28% sys 28% nic 42% idle 0% io 0% irq 0% sirq Load average: 1.01 1.02 1.00 2/85 6912 PID PPID USER STAT VSZ %VSZ %CPU COMMAND 2561 1 root RN 1124 0% 52% /usr/sbin/nlbwmon -o /var/lib/nlbwmon -i 24h -r 30s -p /usr/share/nlbwmon/protocols -G 10 -I 1 -L 10000 -Z -s 192.168.0.0/16 -s 172.16.0.0/12 -s 10.0.0.0/8 -s 192.168.1.1/24 -s fd56:53f0:3524:10::1/60 6912 5816 root R 1092 0% 5% top -b 2927 1 root SN 4372 1% 0% /usr/sbin/collectd -C /tmp/collectd.conf -f

root@jupiter:/sys/devices/system/cpu/cpufreq/policy0# opkg list-installed |grep nlbwmon luci-app-nlbwmon - git-20.010.59544-ff12aa3-1 nlbwmon - 2019-06-06-4574e6e8-2

...
11:34:37.904833 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.905715 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.906228 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
...

Attaching the strace log for a few seconds as a reference. strace_nlbwmon.log

I saw that you @jow- put in https://github.com/jow-/nlbwmon/commit/4574e6e80e3a5ab8de65886baa0d563dfee589ff on the same date (June 6) as the version of nlbwmon I'm supposedly am running on my R7800. Does that suggest the issue is not fully resolved with this commit?

Edit: I noticed this after 2 days (uptime: 2d 15h 33m 0s)

jow- commented 4 years ago

Thanks for the log, I'll investigate later. Unfortunately there is no reliable reproducer for this issue so it might take a while.

digitalcircuit commented 4 years ago

I've run into this issue as well on a ZyXEL NBG6817 with OpenWrt 19.07.0 r10860-a3ffeb413b and nlbwmon 2019-06-06-4574e6e8-1. Unfortunately, I have yet to find a reliable reproducer, either. Most recently it happened after 8d 2h of uptime on a network with WAN, LAN, and two additional VLANs.

If there's any investigation that'd help, I'd be happy to try.

Meanwhile, for those who run into this issue searching for a solution, the forum thread linked above also has a workaround:

Since nlbwmon is frequently going wild and consuming all available processing power from at least one core I've put together some workarounds to be placed in /etc/rc.local file before exit 0 line.

[…]
Automatic restart of nlbwmon process:
(while true; do (NLB_T=45; sleep 15m; NLB_A=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_A -ge $NLB_T ] && ( sleep 5m; NLB_B=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_B -ge $NLB_T ] && ( /etc/init.d/nlbwmon restart; logger "WARNING: nlbwmon restarted due to process runaway, proc% was $NLB_A then $NLB_B" ))); done) &

greekstreet commented 4 years ago

I encountered the same issue. Found this line in syslog. Hope this will help. Sat Feb 15 06:54:23 2020 daemon.err nlbwmon[9132]: Unable to dump conntrack: No buffer space available

I restarted nlbwmon service and then CPU% became normal.

davidl500 commented 4 years ago

I'm encountering the same, including the conntrack error; e.g. Wed Apr 8 22:47:41 2020 daemon.err nlbwmon[1646]: Unable to dump conntrack: No buffer space available

Also a MIPS, but the Lantiq VRX268 (BT Homehub 5A), so xRX200 rev 1.2, running OpenWRT 19.07.2

As above, restarting nlbwmon service restored CPU and nlbw functionality.

jow- commented 4 years ago

Thanks, that helped a lot to track down a potential issue. I hope that https://github.com/jow-/nlbwmon/commit/33c77cb2aa5a323b22d7ba5f01b2650fdcccd00a finally addressed the 100% CPU load issue.

pdfruth commented 4 years ago

My experience so far... I was seeing 100% CPU by nlbwmon pretty regularly. Now running the latest commit, I have 15 days of uptime without seeing the problem.

facboy commented 3 years ago

i'm getting these randomly:

Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Netlink receive failure: Out of memory
Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Unable to dump conntrack: No buffer space available

jow- commented 3 years ago

@facboy - if you see this error, you need to increase option netlink_buffer_size in /etc/config/nlbwmon. The default is 524288 which appears to be too low for your use case.

May I ask how many conntrack entries you have? wc -l /proc/net/nf_conntrack

facboy commented 3 years ago

$ wc -l /proc/net/nf_conntrack
305 /proc/net/nf_conntrack

jow- commented 3 years ago

That is not much, but maybe you have a very high fluctuation of connections, that is conntrack entries appearing and vanishing frequently causing a lot of events to be reported.

martin8801 commented 3 years ago

I've started getting the messages above as well.

wc -l /proc/net/nf_conntrack
395 /proc/net/nf_conntrack

jow- / nlbwmon

100% CPU usage after some time, cannot query bandwidth usage #30