Closed dacarson closed 4 months ago
Hi, sounds great!
One thing that caught my attention first is the single nut-driver.service
- I don't think NUT currently ships one (2.7.4 did). Instead, there would be a nut-driver@.service
template and instances generated by NDE (https://github.com/networkupstools/nut/wiki/nut%E2%80%90driver%E2%80%90enumerator-(NDE)) separately for each driver - so a failure of one does not cause restart of everyone (and their run-time dependencies can differ, e.g. an SNMP UPS driver needs networking while USB/serial/i2c does not and can start ASAP).
I see that your driver is based on recent NUT upstream codebase - it can be helpful to update the rest of the running services to use your build, if only to get the more advanced debugging and other features. I hope there are no fatal incompatibilities (ABI or protocol wise) between older/newer drivers and servers and clients - those would be unfortunate and unplanned - but better safe than sorry in this regard too. Especially with older upsd
and newer drivers - they could use a newer iteration of Unix socket protocol that the server would not accept and at best ignore; this coupling is treated as more intimate and might be less "protected" than the networked protocol intended to talk to unspecified third-party clients.
I am not quickly sure what to make of this part:
May 14 10:20:04 missionpi upsmon[3952]: Using power down flag file /etc/killpower
May 14 10:20:04 missionpi upsmon[3952]: '/etc/killpower' exists, but we can't read from it: No such file or directory
May 14 10:20:04 missionpi upsmon[3952]: POWERDOWNFLAG (/etc/killpower) does not containthe upsmon magic string - disabling!
It seems like the file exists but at the same time "No such file or directory". Wondering it there is a confusing filesystem object (e.g. a symlink pointing nowhere, so there is a directory entry but indeed nothing to read from) or some bug in 2.7.4 about this?
Also not sure about the "disabling" part. It would make sense for upsmon -K
(a separate program call checking late in shutdown that the daemonized copy of upsmon
saved the killpower file upon an FSD) to ignore an invalid file; however the daemon startup should have at best removed the file and marched on (IIRC). So I've re-read the code, and the message with "disabling" (and a missing space) is in fact from clear_pdflag()
which should have removed it but could not in this case - so the daemon treats the flag as not configured (assumes the filename is occupied by something unrelated to NUT so we should not corrupt/remove that file) and probably leads to your other issues with shutting down.
Normally, upsmon
is started by root
and depending on settings it either stays root
, or by default splits into two daemons (a bit under root
and most of work happening under nut
or nobody
). The root
-owned part is responsible for touch-file creation in case of FSD and calling the SHUTDOWNCMD
, and exits (and/or is killed by OS shutdown processing that goes on a killing spree for remaining processes). For systems with late shutdown handling, either with old-style init scripts managing the whole OS life cycle until power-off, or with systemd shutdown hook support, there is a chance to run custom code after that killing spree. The nutshutdown
script can be used (or adapted in non-Linux OSes) to check if the killpower flag exists, and run the driver program to tell the UPS to cut the power (if it supports such operation). Typically such script then sleeps for an hour and reboots, to work around UPSes that would not power off when the wall power has returned at the wrong time (during the shutdown), so your servers would not stay halted indefinitely.
By the way, the thread you've mentioned leads to some other investigations and write-ups about this, notably https://github.com/networkupstools/nut/wiki/Technicalities:-Work-with-PID-and-state-file-paths (noting that e.g. for upsmon
the PID file is that of child process, if they split ways and are not a monoprocess; if the child is killed the parent should exit too).
Hi, sounds great!
Thanks for your prompt and detailed response.
One thing that caught my attention first is the single
nut-driver.service
- I don't think NUT currently ships one (2.7.4 did). Instead, there would be anut-driver@.service
template and instances generated by NDE (https://github.com/networkupstools/nut/wiki/nut%E2%80%90driver%E2%80%90enumerator-(NDE)) separately for each driver - so a failure of one does not cause restart of everyone (and their run-time dependencies can differ, e.g. an SNMP UPS driver needs networking while USB/serial/i2c does not and can start ASAP).
I have seen the newer nut-driver@.service
services. It makes sense to separate them. It looks like I do have an older version installed:
$ sudo apt show nut
Package: nut
Version: 2.7.4-13
Priority: optional
Section: metapackages
Maintainer: Laurent Bigonville <bigon@debian.org>
Installed-Size: 276 kB
Depends: nut-client, nut-server
Homepage: https://networkupstools.org/
Tag: admin::monitoring, hardware::power, hardware::power:ups,
interface::daemon, network::server, role::program, scope::utility
Download-Size: 247 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian bullseye/main arm64 Packages
Description: network UPS tools - metapackage
Network UPS Tools (NUT) is a client/server monitoring system that
allows computers to share uninterruptible power supply (UPS) and
power distribution unit (PDU) hardware. Clients access the hardware
through the server, and are notified whenever the power status
changes.
.
This package is a metapackage that installs both nut-server and nut-client,
in most cases it is sufficient for a basic UPS monitoring system.
Though it says that it is the newest version:
$ sudo apt-get -s install nut
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
nut is already the newest version (2.7.4-13).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
I see that your driver is based on recent NUT upstream codebase - it can be helpful to update the rest of the running services to use your build, if only to get the more advanced debugging and other features. I hope there are no fatal incompatibilities (ABI or protocol wise) between older/newer drivers and servers and clients - those would be unfortunate and unplanned - but better safe than sorry in this regard too.
Until this point, my newer driver has been running fine. I monitor the data from it on another RPi running grafana. I do occasionally see this from nut-driver service:
May 14 10:21:50 missionpi upsplus[4039]: upsnotify: failed to notify about state 2: no notification tech defined, will not spam more about it
May 14 10:21:51 missionpi upsplus[4039]: sock_connect: enabling asynchronous mode (auto)
May 14 10:24:45 missionpi upsplus[4039]: WARNING: send_to_all: write 32 bytes to socket 6 failed (ret=-1), disconnecting: Broken pipe
May 14 10:24:45 missionpi upsplus[4039]: sock_connect: enabling asynchronous mode (auto)
I've tried to understand this issue, but it seems others have just ignored it.
Especially with older
upsd
and newer drivers - they could use a newer iteration of Unix socket protocol that the server would not accept and at best ignore; this coupling is treated as more intimate and might be less "protected" than the networked protocol intended to talk to unspecified third-party clients.
Are there instructions somewhere for where I can find a newer release?
Presently I do have a fork of main
for development, as I do hope to submit my driver back.
I am not quickly sure what to make of this part:
May 14 10:20:04 missionpi upsmon[3952]: Using power down flag file /etc/killpower May 14 10:20:04 missionpi upsmon[3952]: '/etc/killpower' exists, but we can't read from it: No such file or directory May 14 10:20:04 missionpi upsmon[3952]: POWERDOWNFLAG (/etc/killpower) does not containthe upsmon magic string - disabling!
It seems like the file exists but at the same time "No such file or directory". Wondering it there is a confusing filesystem object (e.g. a symlink pointing nowhere, so there is a directory entry but indeed nothing to read from) or some bug in 2.7.4 about this?
I am confused by this too. The file is there. I just assumed that something (magic string?) needs to be inside it:
$ ls -al /etc/killpower
-rw-r--r-- 1 root root 0 Sep 28 2023 /etc/killpower
Also not sure about the "disabling" part. It would make sense for
upsmon -K
(a separate program call checking late in shutdown that the daemonized copy ofupsmon
saved the killpower file upon an FSD) to ignore an invalid file; however the daemon startup should have at best removed the file and marched on (IIRC).
I haven't tried testing with upsmon -K
, though I have been using upsdrvctl -t shutdown
and thus the driver command line /lib/nut/upsplus -a pi -k
to make sure that the UPS does actually shutdown (which it does :-) ):
$ sudo upsdrvctl -t shutdown
Network UPS Tools - UPS driver controller 2.7.4
*** Testing mode: not calling exec/kill
0.000000
If you're not a NUT core developer, chances are that you're told to enable debugging
to see why a driver isn't working for you. We're sorry for the confusion, but this is
the 'upsdrvctl' wrapper, not the driver you're interested in.
Below you'll find one or more lines starting with 'exec:' followed by an absolute
path to the driver binary and some command line option. This is what the driver
starts and you need to copy and paste that line and append the debug flags to that
line (less the 'exec:' prefix).
0.000666 Shutdown UPS: hotwater
0.000785 exec: /lib/nut/usbhid-ups -a hotwater -k
0.000872 Shutdown UPS: pi
0.000929 exec: /lib/nut/upsplus -a pi -k
So I've re-read the code, and the message with "disabling" (and a missing space) is in fact from
clear_pdflag()
which should have removed it but could not in this case - so the daemon treats the flag as not configured (assumes the filename is occupied by something unrelated to NUT so we should not corrupt/remove that file) and probably leads to your other issues with shutting down.
Ah! It sounds like the file shouldn't be there. So I could try just deleting the killpower file and then see what happens...
Normally,
upsmon
is started byroot
and depending on settings it either staysroot
, or by default splits into two daemons (a bit underroot
and most of work happening undernut
ornobody
). Theroot
-owned part is responsible for touch-file creation in case of FSD and calling theSHUTDOWNCMD
, and exits (and/or is killed by OS shutdown processing that goes on a killing spree for remaining processes). For systems with late shutdown handling, either with old-style init scripts managing the whole OS life cycle until power-off, or with systemd shutdown hook support, there is a chance to run custom code after that killing spree.
This did magically happen once, but I don't know why and haven't been able to see it happen again. Here is a sample from syslog of the one time it worked:
May 14 09:06:26 missionpi upsmon[34621]: Communications with UPS pi@localhost lost
May 14 09:06:31 missionpi upsmon[34621]: Communications with UPS pi@localhost established
May 14 09:06:31 missionpi upsmon[34621]: UPS pi@localhost battery is low
May 14 09:06:31 missionpi upsd[34541]: Client upsmon@::1 set FSD on UPS [pi]
May 14 09:06:31 missionpi upsmon[34621]: Executing automatic power-fail shutdown
May 14 09:06:31 missionpi upsmon[34621]: Auto logout and shutdown proceeding
May 14 09:06:36 missionpi systemd[1]: nut-monitor.service: Succeeded.
May 14 09:06:36 missionpi systemd[1]: unattended-upgrades.service: Succeeded.
May 14 09:06:36 missionpi systemd[1]: Stopping Session 1 of user pi.
May 14 09:06:36 missionpi systemd[1]: Stopping Session 3 of user pi.
May 14 09:06:36 missionpi systemd[1]: Stopping Session 4 of user pi.
May 14 09:06:36 missionpi systemd[1]: Removed slice system-modprobe.slice.
...
May 14 09:06:36 missionpi systemd[1]: Stopping Make remote CUPS printers available locally...
May 14 09:06:36 missionpi systemd[1]: Stopping dphys-swapfile - set up, mount/unmount, and delete a swap file...
May 14 09:06:36 missionpi systemd[1]: Stopping Getty on tty1...
May 14 09:06:36 missionpi systemd[1]: glamor-test.service: Succeeded.
May 14 09:06:36 missionpi systemd[1]: Stopped Check for glamor.
May 14 09:06:36 missionpi systemd[1]: gldriver-test.service: Succeeded.
May 14 09:06:36 missionpi systemd[1]: Stopped Check for v3d driver.
The
nutshutdown
script can be used (or adapted in non-Linux OSes) to check if the killpower flag exists, and run the driver program to tell the UPS to cut the power (if it supports such operation).
It does support such an operation. When I run /lib/nut/upsplus -a pi -k
, the UPS shutdown timer starts (via upsdrv_shutdown()
, and then shutsdown. Though the one time upsmon magically happened, the UPS didn't cut power. I need to debug that.
Typically such script then sleeps for an hour and reboots, to work around UPSes that would not power off when the wall power has returned at the wrong time (during the shutdown), so your servers would not stay halted indefinitely.
I'll have to look into this more as I believe my UPS has an issue here.
Thanks for your help! The RPi is shutting down reliably now!
I had two issues;
(a) the /etc/killpower
file, I needed to remove it. Not sure how it got there in the first place and
(b) while reading through the PID Issue, I saw a comment suggesting a change the service
launch from Forking
to Simple
. I tried that out, but it didn't work. When I changed it back to Forking
, I put Folking
.
Fixing these two items fixed it. I guess it happened magically before I messed with (b) and maybe the file is (a) wasn't there.
However, I don't see the shutdown sent to the UPS as specified in step 8 of the shutdown flow:
8. init then runs your shutdown script. This checks for the POWERDOWNFLAG, finds it, and tells the UPS driver(s) to power off the load by sending commands to the connected UPS device(s) they manage.
Where can I find what the "commands" are that are sent to the UPS device?
With debug=1 my driver logs each of the values that are retrieved. I would expect to see the Shutdown timer set to a value if it was told to shutdown. (For my driver, 0 means that the timer isn't running). These are the last entries in syslog before the shutdown:
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Charge Level: 73%
May 14 15:24:58 missionpi bluetoothd[858]: Stopping SDP server
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Battery Voltage: 4.000V
May 14 15:24:58 missionpi bluetoothd[858]: Exit
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Battery Power: 12.532W
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Voltage High: 4.200V
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery runtime: 6165s
May 14 15:24:58 missionpi upsd[4158]: mainloop: Interrupted system call
May 14 15:24:58 missionpi upsd[4158]: Signal 15: exiting
May 14 15:24:58 missionpi systemd[1]: rpi-eeprom-update.service: Succeeded.
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Battery Current: -3.152A
May 14 15:24:58 missionpi upsplus[4039]: WARNING: send_to_all: write 33 bytes to socket 6 failed (ret=-1), disconnecting: Broken pipe
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Output Voltage: 4.928V
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Output Power: 7.414W
May 14 15:24:58 missionpi upsplus[4039]: [D1] UPS Load: 32.952%
May 14 15:24:58 missionpi upsplus[4039]: [D1] INA219 Output Current: 1.593A
May 14 15:24:58 missionpi upsplus[4039]: [D1] MicroUSB Input Voltage: 62.465V
May 14 15:24:58 missionpi upsplus[4039]: [D1] Device uptime: 21196s
May 14 15:24:58 missionpi upsplus[4039]: [D1] Shutdown Timer: 0s
May 14 15:24:58 missionpi systemd[1]: Stopped Check for Raspberry Pi EEPROM updates.
May 14 15:24:58 missionpi upsplus[4039]: [D1] Reboot Timer: 0s
May 14 15:24:58 missionpi upsplus[4039]: [D1] Auto restart on external power: yes
May 14 15:24:58 missionpi upsplus[4039]: [D1] Power status: normal
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Voltage Low: 3.360V
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Voltage High: 4.200V
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Status: Low
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Status: Resting
May 14 15:24:58 missionpi upsplus[4039]: [D1] Battery Voltage High: 4.200V
For installing the current NUT codebase (e.g. your fork) over packaged versions (release/update cadence is the distro thing, can't help much here) with as much re-use of their build configuration as is deemed reasonable, check this wiki doc: https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests
This should in particular deliver a /usr/lib/systemd/system-shutdown/nutshutdown
which systemd should run late during power-off. That's the "step 8" :)
Your /etc/killpower
file maybe was touch
'ed for manual experiments? I made that mistake some time ago, but don't remember its reads claiming "No such file". Maybe if fgets()
involved in that magic token parsing encounters an EOF
right away (empty file), it claims that there is no file? By permissions it would have been at least readable. :\
Note that it can be prudent to use a non-default location under /run
or /dev/shm
rather than /etc
to avoid hitting the flash chips with this not very important data (should disappear after reboot and a new start of upsmon
anyway).
For issue "b", beside changing the service type there should have been also a change of its behavior (running daemons "foregrounded" as far as NUT is concerned, e.g. by enabling debug with -D
since forever, or with new versions - using an explicit -F
option).
For installing the current NUT codebase (e.g. your fork) over packaged versions (release/update cadence is the distro thing, can't help much here) with as much re-use of their build configuration as is deemed reasonable, check this wiki doc: https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests
This is very useful, thank you. Wish I had found that earlier. I had to reverse engineer the existing NUT binaries to work out what I needed to put on the configure
line to build a driver that would work in place. For my platform, and for my driver, I am using:
$ ./configure --with-linux_i2c --with-statepath=/run/nut --sysconfdir=/etc/nut --with-user=nut --with-group=nut --with-pidpath=/run/nut
I created the man document for my driver, but I was never able to get documentation building. With the link above that specifies the packages, I hope to be able to build everything e2e.
This should in particular deliver a
/usr/lib/systemd/system-shutdown/nutshutdown
which systemd should run late during power-off. That's the "step 8" 😊
I do have a /etc/init.d/nut-server
that actually does have a poweroff
step, though when I search syslog I don't see any of the log messages that it should produce.
Your
/etc/killpower
file maybe wastouch
'ed for manual experiments? I made that mistake some time ago, but don't remember its reads claiming "No such file". Maybe iffgets()
involved in that magic token parsing encounters anEOF
right away (empty file), it claims that there is no file? By permissions it would have been at least readable. :\
I could have created it for manual experiments, following an example I found online. Now that I have removed it, I wonder if it is being created correctly. If it wasn't created, that would explain why the RPi is shutting down but not the UPS.
Note that it can be prudent to use a non-default location under
/run
or/dev/shm
rather than/etc
to avoid hitting the flash chips with this not very important data (should disappear after reboot and a new start ofupsmon
anyway).For issue "b", beside changing the service type there should have been also a change of its behavior (running daemons "foregrounded" as far as NUT is concerned, e.g. by enabling debug with
-D
since forever, or with new versions - using an explicit-F
option).
I did add the -D option, when I switched it to Simple. I did remove the -D but mis-spelt Fork.
My next step is to try replacing my NUT deployment with a current version. There seems to be a lot to do to accomplish this. I need to install all the needed components, and what is confusing is, I need to make sense of the DEB scripts and how they are meant to be used.
Feel free to close this issue now. If, and when, I get the time to work through replacing my NUT deployment and I run into issues, I will file a new issue.
Thanks for your help.
For Debian packaging, there's generally debuild
(IIRC) to run and wrap the operation. It has been a while since I touched that, but there is a backlogged issue to pick up "reference" packaging scripts from the 42ITy NUT fork (FTY branch here) and clarify what to do with them, primarily to help users roll their own non-distro packages and have a sort of file-based dialog to suggest what distros can do with NUT integration (and feed back their ideas via PRs to us).
I do have a /etc/init.d/nut-server that actually does have a poweroff step, though when I search syslog I don't see any of the log messages that it should produce.
The posted implementation seems to take care of telling the UPS to power off, sleeping and rebooting the server (if still powered); however it does not seem to ask if the FSD flag was raised in the first place. Also, somebody (who probably checks upsmon -K
for the flag) should call this script with this argument during your shutdown, and have some way for this sleep
before a reboot
to be exempt from the killing spree for all remaining processes (if your OS does that) - e.g. the "systemd-shutdown" hooks achieve just that.
Also not sure if invoke-rc.d
is right for an OS with systemd, although it can be correct for other frameworks (upstart IIRC?) and might be a portability alias in systemd itself? :-\
FWIW I was able to get to the newer version of NUT by moving from Debian 11 bullseye
to Debian 12 bookworm
. Now:
$ sudo apt show nut
Package: nut
Version: 2.8.0-7
...
Hi All, I have been having great fun writing a full NUT driver for a UPSPlus/EP-0136. This device is a HAT for Raspberry Pi, and communicates via i2c. My fork is here. I will see if I can submit it at some point.
Though the purpose of filing this issue is that I can not work out how to get upsmon to shutdown the RPi when it hits LB reliably.
I followed the NUT information/instructions 6.3. Configuring automatic shutdowns for low battery events.
And I searched for some of the errors I am seeing and see this thread on upsmon PID issues though I am unable to workout if this is (a) the problem or (b) how to work around it. I do sometimes see the
upsmon.pid
file appear and it is owned byroot:root
which is different to all the others (see below). But it doesn't appear all the time.Right now, it doesn't seem to want to become 'active'.
nut.conf:
upsd.conf (nothing configured, everything default)
ups.conf
upsd.users
upsmon.conf
nut-server status:
nut-driver status:
nut-monitor status:
PID files:
upsc pi: