Open RobFisher opened 1 year ago
Well, given that libusb reports it can not retrieve device data anymore, I suppose there is not much that NUT can do directly here. Maybe the power surge or relay-flips cause the UPS controller to be momentarily disconnected from the wire or rebooted somehow.
One solution might be to change driver configuration to pollonly
which would not rely on the interrupts and regularly try to query the device. With some HW/OS combinations it is the only way that works, in others it is more computationally expensive so is avoided by default. Note we have #1624 logged, suggesting that pollonly
might be less reliable in detection of outages (claim to be investigated as of now).
Another solution to try is building or finding packaged NUT v2.8.0 or newer (custom builds from github), with libusb1 support - and check if maybe it handles those situations better.
Thanks, @jimklimov , all that makes sense.
Since it was something easy to try, I hooked the UPS to a nearby box running Ubuntu 20.04.5 LTS with Nut version 2.7.4-11ubuntu4. Obviously I am changing a lot of things at once here!
I still see the interrupt timeout but this time the driver seems to soldier on regardless; there is no input/output error.
The Nut version (2.7.4) seems similar; perhaps there is some FreeBSD vs. Linux difference at play here.
I'll see if I can get a newer NUT version for pfSense; in the meantime I might just use this box as the primary since it works better and probably doesn't make much practical difference in my application.
89.060764 upsdrv_updateinfo...
89.310939 libusb_get_interrupt: Connection timed out
89.310965 Got 0 HID objects...
89.310976 Quick update...
89.310984 Entering libusb_get_report
89.356853 Report[get]: (5 bytes) => 0a ff ff ff ff
89.356882 Path: UPS.PowerSummary.DelayBeforeStartup, Type: Feature, ReportID: 0x0a, Offset: 0, Size: 32, Value: 0
89.356894 Entering libusb_get_report
89.403425 Report[get]: (5 bytes) => 09 ff ff ff ff
89.403454 Path: UPS.PowerSummary.DelayBeforeShutdown, Type: Feature, ReportID: 0x09, Offset: 0, Size: 32, Value: 0
89.403471 Report[buf]: (4 bytes) => 01 30 00 00
89.403493 Path: UPS.PowerSummary.PresentStatus.ACPresent, Type: Feature, ReportID: 0x01, Offset: 0, Size: 1, Value: 0
89.403513 Report[buf]: (4 bytes) => 01 30 00 00
89.403529 Path: UPS.PowerSummary.PresentStatus.Discharging, Type: Feature, ReportID: 0x01, Offset: 4, Size: 1, Value: 1
89.403542 Report[buf]: (4 bytes) => 01 30 00 00
89.403551 Path: UPS.PowerSummary.PresentStatus.Charging, Type: Feature, ReportID: 0x01, Offset: 2, Size: 1, Value: 0
89.403559 Report[buf]: (4 bytes) => 01 30 00 00
89.403571 Path: UPS.PowerSummary.PresentStatus.BelowRemainingCapacityLimit, Type: Feature, ReportID: 0x01, Offset: 1, Size: 1, Value: 0
Just posting a comment on here to follow and hopefully contribute. I am running Pfsense Plus 23.01 on Netgate hardware (3100) with the latest NUT package installed 2.8.0_02 with an Eaton Ellipse ECO 650 UPS connected by serial. I was previously running NUT 2.6.5 package on the previous pfsense release with no issues. With the latest versions (upgraded last weekend) after about 24 hours of operation the NUT package switches status to "Unable to retrieve UPS info" and emails me an error every 5 minutes. Simply restarting the UPS service on pfsense resolves the issue. Error from the pfsense logs shows lots of the following:
Mar 4 07:45:44 | upsmon | 76041 | UPS EatonUPS is unavailable
-- | -- | -- | --
Mar 4 07:45:44 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:39 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:34 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:29 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:24 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:19 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:14 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:09 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:45:04 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:59 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:54 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:49 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:44 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:39 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:34 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:29 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:29 | upsd | 77559 | Can't connect to UPS [EatonUPS] (usbhid-ups-EatonUPS): Connection refused
Mar 4 07:44:24 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:44:19 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
When I logged in and restarted the UPS service the logs showed:
Mar 4 07:55:01 | php-cgi | 85002 | notify_monitor.php: Message sent to david@irwin-family.net OK
-- | -- | -- | --
Mar 4 07:54:56 | upsmon | 83148 | Communications with UPS EatonUPS established
Mar 4 07:54:54 | upsd | 85051 | Connected to UPS [EatonUPS]: usbhid-ups-EatonUPS
Mar 4 07:54:52 | usbhid-ups | 93411 | Startup successful
Mar 4 07:54:51 | upsmon | 83148 | UPS EatonUPS is unavailable
Mar 4 07:54:51 | upsmon | 83148 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:54:51 | upsd | 85051 | User local-monitor@::1 logged into UPS [EatonUPS]
Mar 4 07:54:49 | php-cgi | 85002 | notify_monitor.php: Message sent to {EmailAddress} OK
Mar 4 07:54:47 | upsd | 85051 | Startup successful
Mar 4 07:54:47 | upsd | 85051 | Can't connect to UPS [EatonUPS] (usbhid-ups-EatonUPS): Connection refused
Mar 4 07:54:47 | upsd | 85051 | not listening on 127.0.0.1 port 3493
Mar 4 07:54:47 | upsd | 85051 | listening on ::1 port 3493
Mar 4 07:54:47 | upsd | 85051 | listening on 127.0.0.1 port 3493
Mar 4 07:54:47 | upsd | 85051 | not listening on {localIP} port 3493
Mar 4 07:54:47 | upsd | 85051 | listening on pfsense.irwazu.co.uk port 3493
Mar 4 07:54:46 | upsmon | 83148 | Communications with UPS EatonUPS lost
Mar 4 07:54:46 | upsmon | 83148 | UPS [EatonUPS]: connect failed: Connection failure: Connection refused
Mar 4 07:54:46 | upsmon | 83148 | Startup successful
Mar 4 07:54:46 | upsd | 77559 | Signal 15: exiting
Mar 4 07:54:46 | upsd | 77559 | mainloop: Interrupted system call
Mar 4 07:54:46 | upsd | 77559 | User local-monitor@::1 logged out from UPS [EatonUPS]
Mar 4 07:54:46 | upsmon | 76041 | Signal 15: exiting
Mar 4 07:54:43 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:54:38 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:54:33 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Mar 4 07:54:29 | upsd | 77559 | Can't connect to UPS [EatonUPS] (usbhid-ups-EatonUPS): Connection refused
Mar 4 07:54:28 | upsmon | 76041 | Poll UPS [EatonUPS] failed - Driver not connected
Is there any way I can force upsmon to restart the driver/service when this failure happens as a workaround?
@FirstRulez : Is your ECO connected by serial? or USB (log mentions usbhid-ups)?
As for restarting, the drivers should reconnect on their own (2.8.0 and master branch later revised some of that). It is possible a driver would stall (e.g. in some uninterruptable I/O), though in your logs it seems more like it crashed? Or maybe did stall - either way, does not regularly ping-pong with the data server (upsd).
Restarting via upsmon may be possible - e.g. by integrating upssched and having an action to handle a long outage of the driver. Alternately on a home NAS (illumos-based) I had a crontab solution years ago, where it needed to recycle USB link as if replugged - maybe the general idea can help your case: https://github.com/networkupstools/nut/blob/master/scripts/Solaris/reset-ups-usb-solaris.sh.sample
Otherwise, the proper solution is to drill down to root cause - e.g. add debugging to the driver (with 2.8.0+ can be a debug_min 6
option in its ups.conf
section) and when it fails - see more system context around that tinestamp: Were there reconnects in dmesg? Could USB chip fall asleep to power-save? etc.
Also just check which NUT revision is packaged as 2.8.0_2, is it recent or a year old, and try to build/run the newest git code (does pkg have enhanced reconnection code from master branch - maybe the issue is already solved on NUT side).
Sorry for the slow reply - been away with work.
Is your ECO connected by serial? or USB (log mentions usbhid-ups)? USB.
I've been doing some more searching and come across these posts on the Netgate forum which seems to point to a few possible issues. Whilst I have an Eaton UPS the symptoms are very similar to those described for the CyberPower. Unlikely to be the APC issue as I am already running NUT as root.
https://forum.netgate.com/topic/102959/nut-package/1072?_=1678400957085
I have not used the "interruptonly" workaround as this means my UPS does not report battery status, but I am still trying to capture a log of when the driver fails - needs me to spend more time in front of the computer monitoring rather than noticing it's been failed for 3 days...
I have the same issue as #973 but I am using the usbhid driver with an Eaton Ellipse Eco 1600.
When the UPS goes to battery, either communication is lost with the UPS straight away, or else the battery notification is received and shortly afterwards communication is lost with the UPS. In the latter case, the system shuts down.
I am running the nut package for pfSense. nut-2.7.4_20
Debug output below. I've checked that nothing is added to the dmesg log when this happens. If the USB connection was really failing I would expect to see something there. In fact if I physically pull the USB connection I do see messages in the dmesg log for disconnection and re-connection.
When everything is running ok: /usr/local/libexec/nut/usbhid-ups -DDDD -u root -a rack1
Then when power is cut:
Then when power is restored: