networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.99k stars 349 forks source link

APC Smart UPS 1000 SUA communication problem #1886

Open VoiletDragon opened 1 year ago

VoiletDragon commented 1 year ago

Hi folks,

I've setup NUT Server on a RPI4 and connected two APC 1000 units via USB to monitor and to automate Shutdown on a few Servers I run in my DC however I have a problem, at first it works pull Power to either UPSs and console updates with UPS On Battery but issue is its slow on recognising that the UPS switches to battery its a bit hit and miss sometimes it takes 5 seconds or 20 seconds just seems very delayed. Now another problem, after NUT has been running a few days it just stops working, console never sees that either UPS is on battery but the strange thing is that the Web Gui updates, if I disconnect either UPS it detects that on both server and client side never detects that the UPS is on battery after a few days of being left on.

Here's what I've tried, different Raspberry Pi, different USB cables, different distros, tried switching from Ubuntu 22.04 to 18.04 also tried Debian. I've tried pollonly flag as well as polintervals but same problem.

I would like to get this working as I need to automate safe shutdowns on my Infrastructure, I think NUT is good it's just needing it to be reliable.

If anyone has any ideas then let me know as I am basically banging my head on the desk trying to figure it out.

Regards

jimklimov commented 1 year ago

Just in case: are your experiments using NUT as packaged by this or that OS release (so probably variants of old NUT v2.7.4), or a more recent one (2.8.0 or custom build of master branch)?

Worth checking if the problem is not with RPi itself (over past couple of years we had many reports about issues with them, search by the issue tag for more such questions, but maybe that just reflects their popularity and not flaws of the platform). E.g. can you query the system (or see in dmesg) if it power-saved the USB port, etc. over time?

However given that WebUI (and I suppose CLI upsc) report the status correctly, the problem is not with HW and NUT drivers, but rather about communications between the data server upsd and the client upsmon. Maybe some firewall gets restarted nightly and breaks the TCP connection, or some tmp-file cleanup removes stuff (should not matter though... at worst a PID file?) With NUT v2.8.0 or newer you can use debug_min setting in daemon config files to bump their verbosity, and when they get into the state of not updating the ups.status for upsmon you would have more troubleshooting hints (e.g. lost connection between the two - though that should also emit logged warnings I think). For 2.7.4 you would have to modify init/systemd scripts to include -D arguments, and possibly adapt the systemd unit Type to a non-forking behavior while the option is used.

As for some seconds of delay, there was a discussion just recently that this can be throttled by hardware (UPS controller is not eager or quick enough to report issues ASAP, especially if hoping that the power blip is irrelevantly short), beside the NUT driver polling frequency (if not using interrupts) and "upsmon polling of upsd" frequency. Thinking of it, can't say OTOH (gotta look in code) whether upsd has a way to quickly notify clients like upsmon as soon as it knows of a relevant event.

VoiletDragon commented 1 year ago

Just in case: are your experiments using NUT as packaged by this or that OS release (so probably variants of old NUT v2.7.4), or a more recent one (2.8.0 or custom build of master branch)?

Worth checking if the problem is not with RPi itself (over past couple of years we had many reports about issues with them, search by the issue tag for more such questions, but maybe that just reflects their popularity and not flaws of the platform). E.g. can you query the system (or see in dmesg) if it power-saved the USB port, etc. over time?

However given that WebUI (and I suppose CLI upsc) report the status correctly, the problem is not with HW and NUT drivers, but rather about communications between the data server upsd and the client upsmon. Maybe some firewall gets restarted nightly and breaks the TCP connection, or some tmp-file cleanup removes stuff (should not matter though... at worst a PID file?) With NUT v2.8.0 or newer you can use debug_min setting in daemon config files to bump their verbosity, and when they get into the state of not updating the ups.status for upsmon you would have more troubleshooting hints (e.g. lost connection between the two - though that should also emit logged warnings I think). For 2.7.4 you would have to modify init/systemd scripts to include -D arguments, and possibly adapt the systemd unit Type to a non-forking behavior while the option is used.

As for some seconds of delay, there was a discussion just recently that this can be throttled by hardware (UPS controller is not eager or quick enough to report issues ASAP, especially if hoping that the power blip is irrelevantly short), beside the NUT driver polling frequency (if not using interrupts) and "upsmon polling of upsd" frequency. Thinking of it, can't say OTOH (gotta look in code) whether upsd has a way to quickly notify clients like upsmon as soon as it knows of a relevant event.

Hi,

Thanks for your reply, Nut version installed is 2.7.4 from the Debian Repository, Should i try and build from source then?

Here is the configuration

  GNU nano 5.4                                                     ups.conf

pollinterval = 1
maxretry = 3

[vdn-ups01]
    driver = usbhid-ups
    port = auto
    desc = "VDN-UPS01 APC Smart-UPS 1000VA RM"
    vendorid = 051d
    productid = 0002
    serial = AS0613

[vdn-ups02]
    driver = usbhid-ups
    port = auto
    desc = "VDN-UPS02 APC Smart-UPS 1000VA RM"
    vendorid = 051d
    productid = 0002
    serial = AS0613
MONITOR vdn-ups01@localhost 1 admin se master
MONITOR vdn-ups02@localhost 1 admin se master

MINSUPPLIES 1
POLLFREQ 5
POLLFREQALERT 2

NOTIFYCMD "/etc/nut/notifycmd.sh"
# Notification events
NOTIFYFLAG ONLINE     SYSLOG+WALL+EXEC
NOTIFYFLAG ONBATT     SYSLOG+WALL+EXEC
NOTIFYFLAG LOWBATT    SYSLOG+WALL+EXEC
NOTIFYFLAG FSD        SYSLOG+WALL+EXEC
NOTIFYFLAG COMMOK     SYSLOG+WALL+EXEC
NOTIFYFLAG COMMBAD    SYSLOG+WALL+EXEC
NOTIFYFLAG SHUTDOWN   SYSLOG+WALL+EXEC
NOTIFYFLAG REPLBATT   SYSLOG+WALL+EXEC
NOTIFYFLAG NOCOMM     SYSLOG+WALL+EXEC
NOTIFYFLAG NOPARENT   SYSLOG+WALL+EXEC

Regards

jimklimov commented 1 year ago

Wondering if serial numbers on the two devices are really identical (AS0613) or you just obfuscated the config?

If they are indeed the same (happens to several vendors) then a master-branch build can help in that matching by additional bus and/or device entries should be functional (had bugs fixed after 2.8.0 release). It can also help generally as a newer and much revised codebase, with no build warnings (so some might be hidden errors) and friendlier for debugging, troubleshooting and maintenance, and with proper support for libusb-1.0.

In particular for APC devices there was a debacle with their encoding of some values in USB protocol (#1024 et al for context) that was addressed in 2.8.0 and later too. Curious if that is relevant here...

VoiletDragon commented 1 year ago

Wondering if serial numbers on the two devices are really identical (AS0613) or you just obfuscated the config?

If they are indeed the same (happens to several vendors) then a master-branch build can help in that matching by additional bus and/or device entries should be functional (had bugs fixed after 2.8.0 release). It can also help generally as a newer and much revised codebase, with no build warnings (so some might be hidden errors) and friendlier for debugging, troubleshooting and maintenance, and with proper support for libusb-1.0.

In particular for APC devices there was a debacle with their encoding of some values in USB protocol (#1024 et al for context) that was addressed in 2.8.0 and later too. Curious if that is relevant here...

Just rebuilt the Pi, using from Source it's worst now, takes 30 seconds to detect the UPS is on battery. I am out of ideas with this, tried everything and can't seem to fix the problem.

VoiletDragon commented 1 year ago

Update, just tested SNMP on a 9617 Management Card same problem, On Battery is delayed after a re-build it's taking around 30 seconds. This is not good 20% of Battery has drained by this time, When AC is restored it doesn't even detect that for about 30 seconds. I am out of ideas on this.

VoiletDragon commented 1 year ago

Update, Just tried on a Desktop Computer and it is the same, takes around 30 seconds to detect that it's on battery / AC

jimklimov commented 1 year ago

Puzzled here... with the SNMP card however, you likely have a web interface? Does the card report change of status more quickly? If you manually snmpwalk the OID for UPS status, does it reflect online/onbatt instantly or also with a long delay (e.g. is the problem in UPS or in NUT)?

Still, 30 seconds sounds much like default polling frequency. You can make that tick faster - at a stress to both monitoring station and to the NMC (ones I had 20 years ago tended to stall and reboot whenever anything nontrivial hit them).

VoiletDragon commented 1 year ago

Puzzled here... with the SNMP card however, you likely have a web interface? Does the card report change of status more quickly? If you manually snmpwalk the OID for UPS status, does it reflect online/onbatt instantly or also with a long delay (e.g. is the problem in UPS or in NUT)?

Still, 30 seconds sounds much like default polling frequency. You can make that tick faster - at a stress to both monitoring station and to the NMC (ones I had 20 years ago tended to stall and reboot whenever anything nontrivial hit them).

The APC9617 update is instant on the Web Gui but no on NUT itself, How would i change the default polling frequency over USB or SNMP ?

Regards

jimklimov commented 1 year ago

"When nothing helps, RTFM!" heh? ;)

pollfreq should be it for most drivers, between "full updates":

There is also pollinterval for "quick updates" (I think ups.status should be in that selection).

Try running the drivers with higher debug level for a while (simplest by specifying debug_min in ups.conf if using a 2.8.0+ codebase) to see how often they really update stuff and when they see the status change.

On client (upsmon) side, there's also a POLLFREQ of its own - see https://github.com/networkupstools/nut/blob/master/docs/man/upsmon.conf.txt#L265

VoiletDragon commented 1 year ago

Ive noticed on both of my UPS's when running upsc that driver.parameter.pollfreq is set to 30 yet in upsmon.conf pollfreq is set to 5 and pollfreqalert is set to 2 is this normal? it's either that or the settings i am change is not sticking, i've tried changing pollfreq and pollfreqalert in upsmon.conf but not making any difference.

Regards

VoiletDragon commented 1 year ago

You can see here the delay, this is from console over SSH. Broadcast message from root@vdn-ups-mgmt (somewhere) (Fri Mar 31 22:55:44 2023)

UPS ups01@localhost on battery

Broadcast message from root@vdn-ups-mgmt (somewhere) (Fri Mar 31 22:56:04 2023)

UPS ups01@localhost on line power

Broadcast message from root@vdn-ups-mgmt (somewhere) (Fri Mar 31 22:56:34 2023)

UPS ups01@localhost on battery

Broadcast message from root@vdn-ups-mgmt (somewhere) (Fri Mar 31 22:56:54 2023)

UPS ups01@localhost on line power

jimklimov commented 1 year ago

Drivers are configured in ups.conf; client in upsmon.conf, server in upsd.conf, etc.

VoiletDragon commented 1 year ago

Drivers are configured in ups.conf; client in upsmon.conf, server in upsd.conf, etc.

Okay, so how come isn't driver.parameter.pollfreq is set to 30 changing when I set it to 5?

jimklimov commented 1 year ago

i've tried changing pollfreq and pollfreqalert in upsmon.conf but not making any difference.

Sorry, maybe it's too late in the night here but I don't see a statement that you've tried setting pollfreq in ups.conf (section for the device) specifically. And it goes without saying, but just in case - you would need to restart the driver ;)

Also, did you have a chance to test with snmpwalk how soon after the physical state change does the SNMP card reflect it - is there any substantial delay in NMC itself?

Thinking of it, upsmon wall broadcasts may be throttled (at least the data-stale warnings, and maybe on-battery ones, are by default repeated only every 300 sec = 5 min). State changes would probably be reported on the next loop though.

VoiletDragon commented 1 year ago

i've tried changing pollfreq and pollfreqalert in upsmon.conf but not making any difference.

Sorry, maybe it's too late in the night here but I don't see a statement that you've tried setting pollfreq in ups.conf (section for the device) specifically. And it goes without saying, but just in case - you would need to restart the driver ;)

Also, did you have a chance to test with snmpwalk how soon after the physical state change does the SNMP card reflect it - is there any substantial delay in NMC itself?

Thinking of it, upsmon wall broadcasts may be throttled (at least the data-stale warnings, and maybe on-battery ones, are by default repeated only every 300 sec = 5 min). State changes would probably be reported on the next loop though.

i've not tested since with SNMP, I switched to USB, not sure if this is correct or not but i've added it to ups.conf

pollinterval = 1 maxretry = 3 pollfreq = 5

[ups01] driver = usbhid-ups port = auto desc = "VDN-UPS01 APC Smart-UPS 1000VA RM" vendorid = 051d productid = 0002 serial = "AS06132

VoiletDragon commented 1 year ago

i've tried changing pollfreq and pollfreqalert in upsmon.conf but not making any difference.

Sorry, maybe it's too late in the night here but I don't see a statement that you've tried setting pollfreq in ups.conf (section for the device) specifically. And it goes without saying, but just in case - you would need to restart the driver ;)

Also, did you have a chance to test with snmpwalk how soon after the physical state change does the SNMP card reflect it - is there any substantial delay in NMC itself?

Thinking of it, upsmon wall broadcasts may be throttled (at least the data-stale warnings, and maybe on-battery ones, are by default repeated only every 300 sec = 5 min). State changes would probably be reported on the next loop though.

Update,

I managed to figure it out, I had to add it such as,

pollinterval = 1 maxretry = 3

[ups01] driver = usbhid-ups pollfreq = 5 port = auto desc = "VDN-UPS01 APC Smart-UPS 1000VA RM" vendorid = 051d productid = 0002 serial = "AS0613"

Now upsc is showing, upsc ups01@localhost Init SSL without certificate database battery.charge: 100 battery.charge.low: 10 battery.charge.warning: 50 battery.mfr.date: 2021/03/21 battery.runtime: 1260 battery.runtime.low: 120 battery.temperature: 24.3 battery.type: PbAc battery.voltage: 27.7 battery.voltage.nominal: 24.0 device.mfr: American Power Conversion device.model: Smart-UPS 1000 RM device.serial: AS0613230404 device.type: ups driver.name: usbhid-ups driver.parameter.pollfreq: 5 driver.parameter.pollinterval: 1

Time to try this again.

Regards

VoiletDragon commented 1 year ago

I think I am somewhat making progress, Switch to battery now takes around 4 seconds but switching to AC Power takes around 7 seconds, which i think is a improvement but just need to get the AC Power at the same time, any suggestions?

jimklimov commented 1 year ago

Same as before: boost debug, see in real-time what happens on the driver side. Cross-check with other tools/media where possible.

VoiletDragon commented 1 year ago

Same as before: boost debug, see in real-time what happens on the driver side. Cross-check with other tools/media where possible.

How do you enable Boost Debug? It is defiantly better that's for sure

jimklimov commented 1 year ago

Please do read the replies. It is tedious and time-wasting to repeat same words every few posts :\

With NUT v2.8.0 or newer you can use debug_min setting in daemon config files to bump their verbosity, and when they get into the state of not updating the ups.status for upsmon you would have more troubleshooting hints (e.g. lost connection between the two - though that should also emit logged warnings I think). For 2.7.4 you would have to modify init/systemd scripts to include -D arguments, and possibly adapt the systemd unit Type to a non-forking behavior while the option is used.

Daemons in this context are drivers, upsd and ups.conf - all with manageable verbosity.

VoiletDragon commented 1 year ago

Please do read the replies. It is tedious and time-wasting to repeat same words every few posts :\

With NUT v2.8.0 or newer you can use debug_min setting in daemon config files to bump their verbosity, and when they get into the state of not updating the ups.status for upsmon you would have more troubleshooting hints (e.g. lost connection between the two - though that should also emit logged warnings I think). For 2.7.4 you would have to modify init/systemd scripts to include -D arguments, and possibly adapt the systemd unit Type to a non-forking behavior while the option is used.

Daemons in this context are drivers, upsd and ups.conf - all with manageable verbosity.

Sorry about that, It's late here but i am determined to get this working.

I think I have found the sweet spot you know. Setting both driver.parameter.pollfreq:5 and driver.parameter.pollinterval: 1 is a sweet spot. Just going to test it a few times before i apply it to the 6 Servers.

Regards

VoiletDragon commented 1 year ago

Well after many hours it's 3am here but I think I have cracked it, Setting both driver.parameter.pollfreq:5 and driver.parameter.pollinterval has solved it, just need to test it a few more times before i commit to deploying it on Prod. Right now I'm calling it a good night.

Thanks for your time and help Jimklimov. I shall report back in the next few days.

Cheers

Regards.

jimklimov commented 1 year ago

Feel free to word a chapter to the FAQ if there is not a relevant helpful one there, and post a PR with your findings.