networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.11k stars 352 forks source link

NUT service randomly stops connecting to UPS despite service active, restarting service workaround (Cyberpower CP1500PFCLCD) #2667

Open chocmake opened 3 weeks ago

chocmake commented 3 weeks ago

Twice in the last couple months I happened to notice that the NUT server wasn't resolving info about the UPS when queried, that would be resolved when either the nut-server service was restarted or the server rebooted. Below is the most recent experience.

Note that this may have occurred months earlier too but since nothing occured (no powerouts, nothing set up to push to me as logs) I didn't notice any potential prior events.

SSH'ing into the server I checked the nut-server and nut-monitor logs and they were reporting that the connection to the UPS was 'unavaiable' / timing out. Both services reported active status though.

journalctl -u nut-server -n 20 -f output (these messages repeated and latest are from a week ago, for some reason)

Oct 22 03:48:45 pi upsd[467]: Connected to UPS [cyberpower]: usbhid-ups-cyberpower
Oct 22 03:49:11 pi upsd[467]: Data for UPS [cyberpower] is stale - check driver
Oct 22 04:35:04 pi upsd[467]: Send ping to UPS [cyberpower] failed: Resource temporarily unavailable

journalctl -u nut-monitor -n 20 -f output (this message just repeats):

Oct 30 14:45:59 pi upsmon[478]: UPS cyberpower@localhost is unavailable
Oct 30 14:48:14 pi upsmon[478]: UPS [cyberpower@localhost]: connect failed: Connection failure: Connection timed out

When I tried to restart nut-server via sudo systemctl restart nut-server it failed:

Job for nut-server.service failed because a timeout was exceeded.
See "systemctl status nut-server.service" and "journalctl -xe" for details.  

systemctl status nut-server.service output:

● nut-server.service - Network UPS Tools - power devices information server
     Loaded: loaded (/lib/systemd/system/nut-server.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since <date>; 1min 3s ago
    Process: 115904 ExecStart=/sbin/upsd (code=exited, status=0/SUCCESS)
        CPU: 49ms

Oct 30 15:22:15 pi upsd[115904]: fopen /run/nut/upsd.pid: No such file or directory
Oct 30 15:22:15 pi upsd[115904]: listening on 0.0.0.0 port 3493
Oct 30 15:22:15 pi upsd[115904]: listening on 0.0.0.0 port 3493
Oct 30 15:23:45 pi systemd[1]: nut-server.service: start operation timed out. Terminating.
Oct 30 15:23:45 pi upsd[115904]: Can't connect to UPS [cyberpower] (usbhid-ups-cyberpower): Interrupted system call
Oct 30 15:23:45 pi upsd[115904]: Can't connect to UPS [cyberpower] (usbhid-ups-cyberpower): Interrupted system call
Oct 30 15:23:45 pi upsd[115918]: Startup successful
Oct 30 15:23:45 pi upsd[115918]: Signal 15: exiting
Oct 30 15:23:45 pi systemd[1]: nut-server.service: Failed with result 'timeout'.
Oct 30 15:23:45 pi systemd[1]: Failed to start Network UPS Tools - power devices information server.

However, immediately upon starting the service using sudo systemctl start nut-server the connection to the UPS was logged as being established again (both in the server and client terminal SSH sessions) and everything went back to normal.


Environment:

Should be noted nothing else is running on the Pi server beside NUT. Also using a wired connection on the server (Wi-Fi is disabled).


/etc/nut/ups.conf

maxretry = 3
pollinterval = 2

[cyberpower]
    desc = "Cyberpower CP1500PFCLCD"
    driver = "usbhid-ups"
    port = "auto"
    vendorid = "0764"
    productid = "0501"
    product = "CRJB103.551"
    serial = "CPS"
    vendor = "CP1500EPFCLCD"
    bus = "001"
    offdelay = 120
    ondelay = 0
desertwitch commented 3 weeks ago

NUT 2.7.4 is a couple of years old now, do you have any chance to try a newer version? There have been numerous improvements in all areas since then, so it might be worth a shot. See here on how to INSTALL from source: https://github.com/networkupstools/nut/blob/master/INSTALL.nut.adoc

chocmake commented 3 weeks ago

NUT 2.7.4 is a couple of years old now, do you have any chance to try a newer version?

On Bullseye 2.7.4-13 is the only latest version available per apt-cache. It looks like Bookworm's latest is v2.8.0-7 (though have read various issues with Bookworm on Zero 2 W models).

Compiling from source looks a bit more involved, though if that's the only way around this I may have to try it sometime.

I suppose a workaround would be scheduling some script to periodically read the service's logs and trigger a service restart? Though I ran into the timeout issues with that above, so I'd also have to implement fallbacks.

jimklimov commented 3 weeks ago

Well, 2.7.4 is actually close to 8.5 years old now (Mar 2016).

As for building, check also https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests - it lists dependencies/tools as well as the methodology; current recipes have a good chance to inherit build settings detected from the older installation to become a sort of in-place ad-hoc replacement.

With CPS, it may also help to increase the polling rate - their controllers apparently go into power-saving or something, if poked only every half a minute (default).

chocmake commented 3 weeks ago

Thanks. My pollinterval is already rather low (2 seconds), so upped my maxretry to 5 per the Arch wiki suggestion. I've also implemented the usb_resetter suggestion and I'll see how it fares over the coming month.

Otherwise I'll probably have to try the source route.