Closed hsgg closed 5 years ago
This should be fixed with 4574e6e80e3a5ab8de65886baa0d563dfee589ff now - please reopen if it still occurs.
Thank you! I updated my router, and will report back in case it occurs again.
This week I proceed to upgrade my R7800 with the latest master build from hnyman and I'm facing the same problem reported in this issue.
OpenWrt SNAPSHOT r11954-7936cb94a9 / LuCI Master git-20.010.59544-ff12aa3
root@jupiter:~# top -b Mem: 146220K used, 330680K free, 2128K shrd, 6352K buff, 22876K cached CPU: 0% usr 28% sys 28% nic 42% idle 0% io 0% irq 0% sirq Load average: 1.01 1.02 1.00 2/85 6912 PID PPID USER STAT VSZ %VSZ %CPU COMMAND 2561 1 root RN 1124 0% 52% /usr/sbin/nlbwmon -o /var/lib/nlbwmon -i 24h -r 30s -p /usr/share/nlbwmon/protocols -G 10 -I 1 -L 10000 -Z -s 192.168.0.0/16 -s 172.16.0.0/12 -s 10.0.0.0/8 -s 192.168.1.1/24 -s fd56:53f0:3524:10::1/60 6912 5816 root R 1092 0% 5% top -b 2927 1 root SN 4372 1% 0% /usr/sbin/collectd -C /tmp/collectd.conf -f
root@jupiter:/sys/devices/system/cpu/cpufreq/policy0# opkg list-installed |grep nlbwmon luci-app-nlbwmon - git-20.010.59544-ff12aa3-1 nlbwmon - 2019-06-06-4574e6e8-2
...
11:34:37.904833 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.905715 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.906228 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
...
Attaching the strace log for a few seconds as a reference. strace_nlbwmon.log
I saw that you @jow- put in https://github.com/jow-/nlbwmon/commit/4574e6e80e3a5ab8de65886baa0d563dfee589ff on the same date (June 6) as the version of nlbwmon I'm supposedly am running on my R7800. Does that suggest the issue is not fully resolved with this commit?
Edit: I noticed this after 2 days (uptime: 2d 15h 33m 0s)
Thanks for the log, I'll investigate later. Unfortunately there is no reliable reproducer for this issue so it might take a while.
I've run into this issue as well on a ZyXEL NBG6817 with OpenWrt 19.07.0 r10860-a3ffeb413b
and nlbwmon 2019-06-06-4574e6e8-1
. Unfortunately, I have yet to find a reliable reproducer, either. Most recently it happened after 8d 2h
of uptime on a network with WAN, LAN, and two additional VLANs.
If there's any investigation that'd help, I'd be happy to try.
Meanwhile, for those who run into this issue searching for a solution, the forum thread linked above also has a workaround:
Since nlbwmon is frequently going wild and consuming all available processing power from at least one core I've put together some workarounds to be placed in /etc/rc.local file before exit 0 line.
[…]
- Automatic restart of nlbwmon process:
(while true; do (NLB_T=45; sleep 15m; NLB_A=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_A -ge $NLB_T ] && ( sleep 5m; NLB_B=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_B -ge $NLB_T ] && ( /etc/init.d/nlbwmon restart; logger "WARNING: nlbwmon restarted due to process runaway, proc% was $NLB_A then $NLB_B" ))); done) &
I encountered the same issue. Found this line in syslog. Hope this will help. Sat Feb 15 06:54:23 2020 daemon.err nlbwmon[9132]: Unable to dump conntrack: No buffer space available
I restarted nlbwmon service and then CPU% became normal.
I'm encountering the same, including the conntrack error; e.g. Wed Apr 8 22:47:41 2020 daemon.err nlbwmon[1646]: Unable to dump conntrack: No buffer space available
Also a MIPS, but the Lantiq VRX268 (BT Homehub 5A), so xRX200 rev 1.2, running OpenWRT 19.07.2
As above, restarting nlbwmon service restored CPU and nlbw functionality.
Thanks, that helped a lot to track down a potential issue. I hope that https://github.com/jow-/nlbwmon/commit/33c77cb2aa5a323b22d7ba5f01b2650fdcccd00a finally addressed the 100% CPU load issue.
My experience so far... I was seeing 100% CPU by nlbwmon pretty regularly. Now running the latest commit, I have 15 days of uptime without seeing the problem.
i'm getting these randomly:
Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Netlink receive failure: Out of memory
Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Unable to dump conntrack: No buffer space available
@facboy - if you see this error, you need to increase option netlink_buffer_size
in /etc/config/nlbwmon
. The default is 524288
which appears to be too low for your use case.
May I ask how many conntrack entries you have? wc -l /proc/net/nf_conntrack
$ wc -l /proc/net/nf_conntrack
305 /proc/net/nf_conntrack
That is not much, but maybe you have a very high fluctuation of connections, that is conntrack entries appearing and vanishing frequently causing a lot of events to be reported.
I've started getting the messages above as well.
wc -l /proc/net/nf_conntrack
395 /proc/net/nf_conntrack
nlbwmon
seems to go into an infinite loop, using ~50% cpu, refusing to service the frontend. This happens after a few days, or sometimes a few weeks. I'm on MIPS, R6220. I see the same sort of strace as mentioned in a forum thread: lots ofThere are many more of the
EAGAIN
messages than the ones with more information. The commandnlbw -c list
just hangs: