networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.91k stars 344 forks source link

design bug: upsmon spams the logs when cannot talk to driver or when driver cannot talk to ups #2207

Closed mailinglists35 closed 9 months ago

mailinglists35 commented 9 months ago

When nut-monitor cannot talk to /lib/nut/$whatever-driver, or when the communication between the driver and the ups is unavailable, it repeats this message in logs:

nov 29 16:06:01 router upsmon[486203]: Poll UPS [nutdev1@localhost] failed - Driver not connected

I wish it allowed a configuration directive or a command line argument to only print this message once. the polite behaviour is already exhibited by upsd, that only prints an equivalent message once:

nov 29 16:10:44 router upsd[476370]: Can't connect to UPS [nutdev1] (nutdrv_qx-serioux): No such file or directory

the repeating of message every 5 seconds in logs provides no use to me, and only annoyance instead.

this is literally the equivalent of ups emiting audible beeps - that makes sense in physical world as having the role of catching a human's attention that might not be near it all the time, so it makes perfect sense to repeat the beep, because in physical world you cannot roll back in time to listen to sounds in a room (unless you have a sound recording device)

however in electronic world the logs are monitored, and a single line is sufficient to trigger an alert, because unlike in physical world, an alert does not dissapear once it has been produced, and someone will be able to read it any time later, thus it make no logical sense to repeat the line.

the same when comm is lost: upsd politely tells me once

Broadcast message from nut@router (somewhere) (Wed Nov 29 16:19:58 2023):

Communications with UPS nutdev1@localhost lost

nov 29 16:19:58 router upsmon[494593]: Poll UPS [nutdev1@localhost] failed - Data stale
nov 29 16:19:58 router upsmon[494593]: Communications with UPS nutdev1@localhost lost

while upsmon begins spamming the logs every 5 seconds:

nov 29 16:20:13 router upsmon[494593]: Poll UPS [nutdev1@localhost] failed - Data stale
jimklimov commented 9 months ago

I agree, and a PR is welcome :)

For context, not always not everywhere logs are monitored (so as to make alerts) from them in real time. As a side note, NUT probably predates most of the fancier log-monitoring systems you see today, so the unseen/unborn techs were not a factor in the original design :)

In fact, upsmon has "notifications" to WALL/SYSLOG/EXEC for a number of events and state changes, and those are throttled (IIRC e.g. walling once in 5 mins by default for Data Stale).

no logical sense to repeat...

I believe it just runs the polling loop and reports encountered errors. To stderr typically, which is then sysadmin's or packager's "fault" if it ends up logged (though maybe upsdebugx()/upslogx() does ditto or redirect to syslog at some point like forking and/or based on internal config). It is extra work to remember that the same error was reported on last cycle (such work is the welcome PR) and e.g. report it only for non-zero debug verbosity for some time. Probably the throttling approavh for notifications can be used here.

mailinglists35 commented 9 months ago

PR

I am not a programmer, but I am wondering if the code logic that emits the similar log in upsd can not be applied to the routine logic blocks in upsmon? would it be feasible?

jimklimov commented 9 months ago

Had a bit of time to take a look at this locally, adding a dummy-ups driver definition to monitor, which is not actually serving data, and I see logged messages every 30 sec (not 5 as in original post), and notifications (stream landing to syslog) every 5 min:

# systemctl reload upsmon
Dec 01 09:30:03 pve systemd[1]: Reloading Network UPS Tools - power device monitor and shutdown controller.
Dec 01 09:30:03 pve nut-monitor[3743529]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 09:30:03 pve nut-monitor[8773]: Reloading configuration
Dec 01 09:30:03 pve nut-monitor[8773]: UPS: dummy (monitoring only)
Dec 01 09:30:03 pve nut-monitor[8773]: Applying debug level 0 from original command line arguments
Dec 01 09:30:03 pve nut-monitor[8773]: Applying pollfail_log_throttle_max=5 from upsmon.conf
Dec 01 09:30:03 pve systemd[1]: Reloaded Network UPS Tools - power device monitor and shutdown controller.
Dec 01 09:30:03 pve nut-monitor[8773]: Poll UPS [dummy] failure state code changed from -1 to 39; report below will only be repeated to syslog every 5 polling loop cycles:
Dec 01 09:30:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:30:03 pve nut-monitor[8773]: Communications with UPS dummy lost
Dec 01 09:30:03 pve nut-monitor[3743532]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
...
Dec 01 09:35:13 pve nut-monitor[8773]: UPS dummy is unavailable
Dec 01 09:35:13 pve nut-monitor[3773560]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 09:35:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:36:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:36:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:37:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:37:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:38:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:38:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:39:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:39:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:40:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:40:18 pve nut-monitor[8773]: UPS dummy is unavailable
Dec 01 09:40:18 pve nut-monitor[3805846]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 09:40:33 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected
Dec 01 09:41:03 pve nut-monitor[8773]: Poll UPS [dummy] failed - Driver not connected

Regarding the 30-sec poll-fail message cadence - I don't see anything to that effect in explicit config (the 5-minute NOCOMMWARNTIME 300 is there for example), although it may be an off-by-one error with "pollfail_log_throttle_max=5" handling (assuming (POLLFAIL_LOG_THROTTLE_MAX+1) * POLLFREQ = (5+1)*5 = 30 here); as a coder gotta suspect things like these:

# egrep '[0-9]' /etc/nut/upsmon.conf
MONITOR dummy 0 upsmon passw0rd primary
MONITOR eco650 1 upsmon "passw0rd" primary
MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
RBWARNTIME 43200
NOCOMMWARNTIME 300
FINALDELAY 5
#DEBUG_MIN 2
POLLFAIL_LOG_THROTTLE_MAX 5

After I actually upsdrvctl start dummy I see:

Dec 01 09:55:43 pve nut-monitor[8773]: Poll UPS [dummy] recovered from failure state code 39 - now -1
Dec 01 09:55:43 pve nut-monitor[8773]: Communications with UPS dummy established
Dec 01 09:55:43 pve nut-monitor[3921395]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb

...and it does not complain anymore.

Then I pause the driver process, so it is not dead but does not respond (SIGSTOP here), and this triggers a "Data stale" situation, reported every 30 sec like for a not-connected driver:

root@pve:~# kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

root@pve:~# ps -ef | grep dumm
nut      3921184       1  0 09:55 ?        00:00:00 /usr/local/ups/bin/dummy-ups -a dummy
root     3986247 3705589  0 09:57 pts/0    00:00:00 grep dumm

root@pve:~# kill -19 3921184

root@pve:~# 
Dec 01 09:58:23 pve nut-monitor[8773]: Poll UPS [dummy] failure state code changed from -1 to 10; report below will only be repeated to syslog every 5 polling loop cycles:
Dec 01 09:58:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 09:58:23 pve nut-monitor[8773]: Communications with UPS dummy lost
Dec 01 09:58:23 pve nut-monitor[4001392]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 09:58:53 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 09:59:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 09:59:53 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:00:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:00:53 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:01:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:01:53 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:02:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:02:53 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:03:18 pve nut-monitor[8773]: UPS dummy is unavailable
Dec 01 10:03:18 pve nut-monitor[4048363]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 10:03:23 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
...

Changing POLLFAIL_LOG_THROTTLE_MAX in upsmon.conf and reloading upsmon:

# systemctl reload upsmon
Dec 01 10:10:02 pve systemd[1]: Reloading Network UPS Tools - power device monitor and shutdown controller.
Dec 01 10:10:02 pve nut-monitor[4112411]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 10:10:02 pve nut-monitor[8773]: Reloading configuration
Dec 01 10:10:02 pve nut-monitor[8773]: Applying debug level 0 from original command line arguments
Dec 01 10:10:02 pve nut-monitor[8773]: Applying pollfail_log_throttle_max=3 from upsmon.conf
Dec 01 10:10:02 pve systemd[1]: Reloaded Network UPS Tools - power device monitor and shutdown controller.

Yep, now it is logged every 20 sec ((3+1)*5) so an off-by-one case is here :)

Dec 01 10:10:12 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:10:32 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:10:52 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:11:12 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale
Dec 01 10:11:32 pve nut-monitor[8773]: Poll UPS [dummy] failed - Data stale

With the setting commented away, and upsmon restarted (reload does not suffice for a disappeared config here) it indeed logs every 5 sec as in original post:

Dec 01 10:13:42 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:13:42 pve nut-monitor[4156012]: UPS dummy is unavailable
Dec 01 10:13:42 pve nut-monitor[4158119]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 10:13:47 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:13:52 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:13:57 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:14:02 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:14:07 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:14:12 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:14:17 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale

In docs/man/upsmon.conf.txt I see:

Also upsmon normally reports polling failures for each device that are in place for each POLLFREQ loop (e.g. "Data stale" or "Driver not connected") to system log as configured. If your devices are expected to be AWOL for an extended timeframe, you can use POLLFAIL_LOG_THROTTLE_MAX to reduce the stress on syslog traffic and storage, by posting these messages only once in every several loop cycles, and when the error condition has changed or cleared. A negative value means standard behavior, and a zero value means to never repeat the message (log only on start and end/change of the failure state).

...and similar in conf/upsmon.conf.sample.

jimklimov commented 9 months ago

A bit more sleuthing, and PR #1784 comes up, so it seems I've already fixed that earlier in the busy year and it got fully flushed from memory, oops -\_(o,o)_/=

The feature should be part of NUT codebase since January, so would be in v2.8.1 release (issued a month ago). Should not have the off-by-one behavior in v2.8.2 and later, thanks ;)

jimklimov commented 9 months ago

Following up on the experiment above: with the zero POLLFAIL_LOG_THROTTLE_MAX value it does not spam the stderr anymore at all; notifications (each 300 sec = 5 min) are not impacted:

Dec 01 10:16:07 pve systemd[1]: Reloading Network UPS Tools - power device monitor and shutdown controller.
Dec 01 10:16:07 pve nut-monitor[4189297]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 10:16:07 pve nut-monitor[4156012]: Reloading configuration
Dec 01 10:16:07 pve nut-monitor[4156012]: Applying debug level 0 from original command line arguments
Dec 01 10:16:07 pve nut-monitor[4156012]: Applying pollfail_log_throttle_max=0 from upsmon.conf
Dec 01 10:16:07 pve nut-monitor[4156012]: Poll UPS [dummy] failure state code changed from -1 to 10; report below will not be repeated to syslog:
Dec 01 10:16:07 pve nut-monitor[4156012]: Poll UPS [dummy] failed - Data stale
Dec 01 10:16:07 pve systemd[1]: Reloaded Network UPS Tools - power device monitor and shutdown controller.
Dec 01 10:18:47 pve nut-monitor[4156012]: UPS dummy is unavailable
Dec 01 10:18:47 pve nut-monitor[17155]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb
Dec 01 10:23:52 pve nut-monitor[4156012]: UPS dummy is unavailable
Dec 01 10:23:52 pve nut-monitor[62402]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb

# kill -SIGCONT 3921184
Dec 01 10:28:32 pve nut-monitor[4156012]: Poll UPS [dummy] recovered from failure state code 10 - now -1
Dec 01 10:28:32 pve nut-monitor[4156012]: Communications with UPS dummy established
Dec 01 10:28:32 pve nut-monitor[94081]: Network UPS Tools upsmon 2.8.1-30-ge1fc248fb

So there's that. Noise-less mode also present and working.

A "design bug" was here, Now it is gone, Projects are like that :)

jimklimov commented 9 months ago

Thanks for the report!

While it was ultimately about an issue already solved and published in a NUT release (seems you've never mentioned which one you are running BTW - given that v2.8.1 is relatively recent, your distro's packaging might have not caught up yet), it also helped discover some issues in the earlier implementation so the next release would be even better :)