networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2k stars 349 forks source link

Looking at logs, `upsmon` `notify()` correlates with a new PID and program banner #1783

Closed jimklimov closed 6 months ago

jimklimov commented 1 year ago

Example from systemd journal:

Jan 07 22:00:37 pve nut-monitor[3986717]: Poll UPS [eco650] failed - Driver not connected
Jan 07 22:00:37 pve nut-monitor[3986717]: UPS eco650 is unavailable
Jan 07 22:00:37 pve nut-monitor[3996883]: Network UPS Tools upsmon 2.8.0-Windows-313-ga3fb36942
Jan 07 22:00:42 pve nut-monitor[3986717]: Poll UPS [eco650] failed - Driver not connected
...
Jan 07 22:05:42 pve nut-monitor[3986717]: Poll UPS [eco650] failed - Driver not connected
Jan 07 22:05:42 pve nut-monitor[3986717]: UPS eco650 is unavailable
Jan 07 22:05:42 pve nut-monitor[4017882]: Network UPS Tools upsmon 2.8.0-Windows-313-ga3fb36942
Jan 07 22:05:47 pve nut-monitor[3986717]: Poll UPS [eco650] failed - Driver not connected
Jan 07 22:05:52 pve nut-monitor[3986717]: Poll UPS [eco650] failed - Driver not connected

So it is every 5 minutes (corresponding to default NOCOMMWARNTIME) and notify() does fork() for NOTIFYCMD and popen() for wall(); however neither of these should print the Network UPS Tools upsmon 2.8.0-Windows-313-ga3fb36942 banner which I only see in main() :\

Systemd unit state (uptime) remains uninterrupted.

In greater detail around the notification event:


Jan 07 22:21:41 pve nut-monitor[4043648]:  110.005340        [D2] pollups: eco650
Jan 07 22:21:41 pve nut-monitor[4043648]:  110.005407        [D2] parse_status: [OL]
Jan 07 22:21:41 pve systemd[1]: Stopping Network UPS Tools - device driver for NUT device 'eco650'...
Jan 07 22:21:42 pve usbhid-ups[4044219]: Signal 15: exiting
Jan 07 22:21:42 pve upsd[3986705]: Can't connect to UPS [eco650] (usbhid-ups-eco650): No such file or directory
Jan 07 22:21:42 pve nut-server[3986705]: Can't connect to UPS [eco650] (usbhid-ups-eco650): No such file or directory
Jan 07 22:21:43 pve nut-driver@eco650[4045862]: fopen /var/state/ups/usbhid-ups-eco650.pid: No such file or directory
Jan 07 22:21:43 pve nut-driver@eco650[4045862]: Network UPS Tools - UPS driver controller 2.8.0-Windows-313-ge6d4008da
Jan 07 22:21:43 pve systemd[1]: nut-driver@eco650.service: Succeeded.
Jan 07 22:21:43 pve systemd[1]: Stopped Network UPS Tools - device driver for NUT device 'eco650'.
Jan 07 22:21:46 pve nut-monitor[4043648]:  115.005476        [D2] pollups: eco650
Jan 07 22:21:46 pve nut-monitor[4043648]:  115.005585        [D1] Poll UPS [eco650] failed - [Driver not connected]
Jan 07 22:21:46 pve nut-monitor[4043648]:  115.005592        [D2] do_notify: ntype 0x0005 (COMMBAD)
Jan 07 22:21:46 pve nut-monitor[4043648]:  115.005596        Communications with UPS eco650 lost
Jan 07 22:21:46 pve nut-monitor[4045962]: Network UPS Tools upsmon 2.8.0-Windows-313-ge6d4008da
Jan 07 22:21:51 pve nut-monitor[4043648]:  120.005787        [D2] pollups: eco650
Jan 07 22:21:51 pve nut-monitor[4043648]:  120.005859        [D1] Poll UPS [eco650] failed - [Driver not connected]
jimklimov commented 6 months ago

Easily reproduced by running a working setup, watching the logs, and stopping a driver. With DEBUG_MIN 6 in upsmon.conf:

### Monitoring OK

Apr 06 19:23:01 pve nut-monitor[3127289]:   15.002685        [D4:3127289] ups_on_line: eco650 (no change)
Apr 06 19:23:01 pve nut-monitor[3127289]:   15.002687        [D3:3127289] Current power value: 1
Apr 06 19:23:01 pve nut-monitor[3127289]:   15.002689        [D3:3127289] Minimum power value: 1

### Stopping driver

Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002761        [D2:3127289] pollups: eco650
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002782        [D3:3127289] get_var: eco650 / status
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002867        [D3:3127289] pollups: Poll UPS [eco650] after getvar(status) failed: upserror=39
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002873        Poll UPS [eco650] failed - Driver not connected
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002877        [D2:3127289] do_notify: ntype 0x0005 (COMMBAD)
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002887        [D6:3127289] notify: sending notification for [eco650]: type COMMBAD with flags 0x000e: Communications with UPS eco650 lost
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002890        [D6:3127289] notify: NOTIFY_SYSLOG (as LOG_NOTICE)
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.002893        Communications with UPS eco650 lost
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.003070        [D6:3127289] notify (parent): forked a child to notify via subprocesses
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.003080        [D3:3127289] Current power value: 1
Apr 06 19:23:06 pve nut-monitor[3127289]:   20.003085        [D3:3127289] Minimum power value: 1
Apr 06 19:23:06 pve nut-monitor[3134816]:   20.003118        [D6:3134816] notify (child): forked to notify via subprocesses
Apr 06 19:23:06 pve nut-monitor[3134816]:   20.003137        [D6:3134816] notify (child): NOTIFY_WALL
Apr 06 19:23:06 pve nut-monitor[3134816]:   20.005198        [D6:3134816] notify (child): NOTIFY_EXEC: calling NOTIFYCMD as '/usr/local/ups/bin/notifyme "Communications with UPS eco650 lost"'
Apr 06 19:23:06 pve nut-monitor[3134816]: Network UPS Tools upsmon 2.8.2-22-g60f76bfa9

### Next loop

Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003157        [D2:3127289] pollups: eco650
Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003171        [D3:3127289] get_var: eco650 / status
Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003246        [D3:3127289] pollups: Poll UPS [eco650] after getvar(status) failed: upserror=39
Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003250        Poll UPS [eco650] failed - Driver not connected
Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003252        [D3:3127289] Current power value: 1
Apr 06 19:23:11 pve nut-monitor[3127289]:   25.003254        [D3:3127289] Minimum power value: 1

NOTE: Further debugging aid via envvar:

# cat /etc/systemd/system/nut-monitor.service.d/debug.conf
[Service]
Environment="NUT_DEBUG_PID=yes"
jimklimov commented 6 months ago

After adding some more debug messages, it seems that the banner appears as part of exit() handling for parent and child processes alike, although it may be not in upsmon.c itself -- don't see any atexit() registrations. Checking open_syslog() etc...

jimklimov commented 6 months ago

And this seems to happen for other NUT daemon exits, e.g.:

:;  systemctl stop nut-driver@eco650
...
Apr 06 20:08:32 pve systemd[1]: Stopping Network UPS Tools - device driver for NUT device 'eco650'...
Apr 06 20:08:32 pve nut-driver@eco650[4020967]: Signal 15: exiting
Apr 06 20:08:32 pve usbhid-ups[4020967]: Signal 15: exiting
Apr 06 20:08:33 pve nut-driver@eco650[4133136]: fopen /var/state/ups/usbhid-ups-eco650.pid: No such file or directory
Apr 06 20:08:33 pve nut-driver@eco650[4133136]: Network UPS Tools - UPS driver controller 2.8.2-24-g962c15437
Apr 06 20:08:33 pve systemd[1]: nut-driver@eco650.service: Succeeded.
Apr 06 20:08:33 pve systemd[1]: Stopped Network UPS Tools - device driver for NUT device 'eco650'.
jimklimov commented 6 months ago

Just beautiful! The message did come from main() initialization of the daemons. It was also stuck in the buffer (at least for non-interactive logging) and was cloned into forked child processes, so every exit() called a fflush() of open file descriptors while closing them - and that got the buffered message actually printed.

One give-away was that the banner was not actually seen in systemd journals when the service started.