Open jimklimov opened 5 years ago
privet @jimklimov I'd like to make a setup to reproduce this, since the fix may not be that trivial. Questions:
We are plagued by the same problem. In our case we would like to monitor a Eaton 93PS and tried netxml-ups and snmp-ups and get the stale/no longer stale cycle with both. I would be able to provide tcpdump data and we are able to test/compile patches on that machine if someone is interested in solving the problem.
Might this be solved with https://github.com/networkupstools/nut/issues/542 ??
@aquette : for the device in that issue, it was a generic usbhid-ups
:
[eaton]
driver = usbhid-ups
synchronous = yes
port = /dev/usb/463.ffff/0
desc = "EATON Protection Station AN2E49008"
# pollonly = true
The MAXAGE
is not customized, so should be 15 sec per comments in the config.
I had the same issue here with the snmp-ups driver and a HP R/T3000 UPS with a network module.
The polling rate is 15s and I've tried to change MAXAGE up to 60s to no effect. The weird thing is that "is stale" and then "no longer stale" were occurring within the same second, which I couldn't explain, except by some race condition or something in upsd.
But I then sniffed the network interface of my NUT host for SNMP traffic. There was indeed no timeout issue, all the requests/replies occurring in time. But then I noticed that during its series of GETs, sometimes (but no always!) the driver was requesting an additional but non-existent (incomplete) SNMP OID. In that case the UPS was replying with a noSuchName error. There were as many is-stale/not-stale as there were GETs for such non-existent SNMP variables. See below:
tail -f /var/log/messages
Mar 22 10:38:42 myhost upsd: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:42 myhost upsd[22486]: UPS [rt3000] data is no longer stale
Mar 22 10:38:42 myhost upsd: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd[22486]: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd[22486]: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd[22486]: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd[22486]: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd[22486]: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd: Data for UPS [rt3000] is stale - check driver
Mar 22 10:38:43 myhost upsd[22486]: UPS [rt3000] data is no longer stale
Mar 22 10:38:43 myhost upsd: UPS [rt3000] data is no longer stale
tcpdump -i ens192 'udp port 161'
10:38:42.952190 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.4.1.0
10:38:42.959678 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.4.1.0=10
10:38:42.959763 IP myhost.51497 > ups.snmp: GetRequest(34) E:232.165.3.4.4.1.4
10:38:42.967535 IP ups.snmp > myhost.51497: GetResponse(34) noSuchName@1 E:232.165.3.4.4.1.4=
10:38:42.967613 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.9.3.0
10:38:42.975343 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.9.3.0=3000
10:38:42.975449 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.4.5.0
10:38:42.982738 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.4.5.0=3
10:38:42.982809 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.2.5.0
10:38:42.989580 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.2.5.0=4
10:38:42.989646 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.4.5.0
10:38:42.996967 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.4.5.0=3
10:38:42.997034 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.7.2.0
10:38:43.004655 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.7.2.0=2
10:38:43.004725 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.8.1.0
10:38:43.011727 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.8.1.0=0
10:38:43.011780 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.8.2.0
10:38:43.019192 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.8.2.0=0
10:38:43.019262 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.6.1.0
10:38:43.026700 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.6.1.0=30
10:38:43.026771 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.6.2.0
10:38:43.034295 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.6.2.0=0
10:38:43.034364 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.6.3.0
10:38:43.041262 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.6.3.0=40
10:38:43.041335 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.2.4.0
10:38:43.048508 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.2.4.0=99
10:38:43.048574 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.2.1.0
10:38:43.055751 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.2.1.0=4221
10:38:43.055837 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.2.2.0
10:38:43.063191 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.2.2.0=130
10:38:43.063258 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.2.3.0
10:38:43.069987 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.2.3.0=0
10:38:43.070058 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.3.3.0
10:38:43.077273 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.3.3.0=1
10:38:43.077342 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.3.1.0
10:38:43.084524 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.3.1.0=498
10:38:43.084608 IP myhost.51497 > ups.snmp: GetRequest(34) E:232.165.3.3.4.1.2
10:38:43.091429 IP ups.snmp > myhost.51497: GetResponse(34) noSuchName@1 E:232.165.3.3.4.1.2=
10:38:43.091516 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.3.4.1.2.1
10:38:43.099143 IP ups.snmp > myhost.51497: GetResponse(37) E:232.165.3.3.4.1.2.1=230
10:38:43.099243 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.9.2.0
10:38:43.106727 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.9.2.0=230
10:38:43.106795 IP myhost.51497 > ups.snmp: GetRequest(34) E:232.165.3.3.4.1.3
10:38:43.114091 IP ups.snmp > myhost.51497: GetResponse(34) noSuchName@1 E:232.165.3.3.4.1.3=
10:38:43.114166 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.3.4.1.3.1
10:38:43.121087 IP ups.snmp > myhost.51497: GetResponse(36) E:232.165.3.3.4.1.3.1=3
10:38:43.121170 IP myhost.51497 > ups.snmp: GetRequest(34) E:232.165.3.3.4.1.4
10:38:43.128752 IP ups.snmp > myhost.51497: GetResponse(34) noSuchName@1 E:232.165.3.3.4.1.4=
10:38:43.128828 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.3.2.0
10:38:43.136010 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.3.2.0=1
10:38:43.136092 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.4.3.0
10:38:43.143319 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.4.3.0=1
10:38:43.143392 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.4.2.0
10:38:43.150226 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.4.2.0=498
10:38:43.150308 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.9.4.0
10:38:43.158029 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.9.4.0=500
10:38:43.158107 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.4.4.1.2.1
10:38:43.165591 IP ups.snmp > myhost.51497: GetResponse(37) E:232.165.3.4.4.1.2.1=230
10:38:43.165671 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.9.1.0
10:38:43.173157 IP ups.snmp > myhost.51497: GetResponse(35) E:232.165.3.9.1.0=230
10:38:43.173233 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.4.4.1.3.1
10:38:43.180234 IP ups.snmp > myhost.51497: GetResponse(36) E:232.165.3.4.4.1.3.1=1
10:38:43.180312 IP myhost.51497 > ups.snmp: GetRequest(33) E:232.165.3.10.1.0
10:38:43.188018 IP ups.snmp > myhost.51497: GetResponse(34) E:232.165.3.10.1.0=2
10:38:43.188100 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.10.2.1.2.1
10:38:43.195808 IP ups.snmp > myhost.51497: GetResponse(36) E:232.165.3.10.2.1.2.1=1
10:38:43.195887 IP myhost.51497 > ups.snmp: GetRequest(35) E:232.165.3.10.2.1.2.2matter of los
10:38:43.203640 IP ups.snmp > myhost.51497: GetResponse(36) E:232.165.3.10.2.1.2.2=1
In cases where stale states were occurring during several consecutive polls (>DEADTIME), it was triggering false positive NOCOMM notifications which I was receiving by email. Very annoying...
My snmp-ups driver was configured for HP's specific SNMP MIB (cpqpower) but then I tried with standard IETF's MIB (which my UPS supports): the problem disappeared!...
So, in my case at least, is-stale/not-stale was not a matter of network connection, nor an update time issue. There should be something buggy in the snmp-ups driver that makes it sometimes (!) requesting wrong OIDs (at least with cpqpower MIB but maybe with other MIBs too), which then cause SNMP errors, which are reported as errors to upsd, which considers the UPS faulty (which indeed is incorrect).
I agree that UPSD is too eager to report Data stale...
On our relatively busy NAS server there are occasionally console messages that
Data for UPS X is stale
. In original default settings, the messages fordata is no longer stale
do not get to the console (which looked worrysome, to see only lots of failures, until I updated logging settings).Now that both types of messages are here, I see that stale and not-stale often happen in the same second, so with a bit of patience much of this spam could be avoided :)
To me it seems like some off-by-one math error in patience calculation, or perhaps a hint to induce one to avoid spurious messages on systems a bit too busy to context-switch fast?
Guessing from cases I've seen before in different codebases, maybe there is a rounding error of sec/msec/usec (including granularity of time reports from an OS - more pronounced for filesystem objects which may be off by 1-2 seconds) and comparisons of those numbers assuming infinite (or at least same) precision?
Also sometimes, less often, there are connection losses and re-establishments logged by
upsmon
some 9-10 seconds apart (probably hitting a 10-second frequency of status evaluation somewhere)... I am not sure how those could be avoided. Perhaps,upsmon
could do some double-check (and/or tellupsd
to do so) before resorting to a verdict of connection loss.Example "screenshot" below: