networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.03k stars 349 forks source link

Wishlist: add a configurable timeout between seeing "OB LB" and actual shutdown #321

Closed dark-penguin closed 6 months ago

dark-penguin commented 8 years ago

In many cases, it would be very convenient to have the system not react immediately to an "OB LB" status, but wait a little to make sure it doesn't go away in a couple of seconds. For example:

See this thread in the mailing lists for more info and extremely complicated workarounds that could be solved easily: http://lists.alioth.debian.org/pipermail/nut-upsuser/2016-September/010260.html

clepple commented 8 years ago

For the first case, the ignorelb option can be added to ups.conf: http://networkupstools.org/docs/man/ups.conf.html#_ups_fields (only the driver would need to be restarted - if this is complicated, please request that your distribution simplify that procedure).

I'm not sure I follow the second part. If you can't trust the UPS to get through a self-test without signaling low battery, how can you be sure that there is enough power left to shut down properly in a real power failure? I don't think many UPSes reliably report that a test is in progress, so I doubt we could add generic logic to "ignore LB during a test".

In the current driver architecture, the only state that is carried over between polling cycles is the connection information (to reconnect to USB devices) and whether or not the previous poll worked (for data-stale notification). Adding a timeout, while it sounds simple, would require adding an extra history layer to drivers to keep track of when the LB flag was last seen, and to handle all of the possible transitions. As someone who expects the UPS to provide a working LB signal, I would prefer that any such changes happen outside of the driver and upsmon.

There has been talk of integrating a Lua interpreter into drivers - maybe there is room for another upsmon/upssched hybrid which uses a scripting language to capture the intricacies of situations like these.

dark-penguin commented 8 years ago

OK, I understand that it would be hard to implement due to the current driver architecture. More thoughts about this later; first, let me explain other things so that everything else is clear.

For the first case of "periodical training": yes, adding "ignorelb" manually every time would help, but it would be even easier to just stop the driver for the training time or disconnect the interface cable. That's what I'm trying to avoid; if I could just add a 10-second timeout, I wouldn't have to do anything at all - 10 seconds is enough for me to flip the power switch back on.

I don't suggest that we go to extreme complications to implement this, but it would be a very useful feature to have in general; I'll explain with more examples. And I'm not talking about trying to find out whether it's "just a self-test" or something like that. What I'm talking about is:

1) It is not uncommon at all to have a power loss (OB-state) for only a moment. The possible reasons are:

2) It is also not uncommon at all to have your UPS in a LB-state, sometimes for prolonged periods of time, for no good reason. The possible reasons are:

3) When those two happen at the same time, which is really not as uncommon as we hope, everything shuts down immediately.

So, consider these examples:

For some cases, it's indeed possible to add the "ignorelb" option, and configure other options and overrides to have NUT set the LB itself. But that's more complex, and in some cases, that wouldn't help. And anyway, shutting down without waiting even for a moment does seem like a hasty decision to me.

So, the question is, how to implement it. I'm not very familiar with the inner details of NUT, but based on what I see... We have FINALDELAY and HOSTSYNC. What I expected when I read the manual was:

So, when the UPS goes into "OB LB" state, slaves see it, but don't react without a command from the master (unless the command never arrives). The master tells everyone to get ready for shutdown, then waits a little to see if the power comes back a few seconds later, and only then sends the "OK, shutdown now!" command. Then the master waits for the slaves to shut down, and starts its own shutdown procedure.

This way, after the shutdown command has been sent, there is no way back; but before the command is sent, after waiting for FINALDELAY - it's still not too late to cancel the shutdown!

Would this be possible to implement somehow? If it changes too much in the established shutdown order, this may very well be optional, toggled by a special parameter in nut.conf or something.

(POWEROFF_WAIT: In case you've missed the Debian bug report, here is it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=835634 ) (By the way, "/sbin/upsmon -K" doesn't work either - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=835555 , but the Debian bugs policy is to not post bugs upstream if they are already posted in Debian, so I didn't copy it...)

jimklimov commented 6 months ago

I've finally got to reading through this thread and referenced Debian bugs. Some of this seems still relevant, but not all (after the years of changes).

Regarding POWEROFF_WAIT and systemd, this should have got fixed by nutshutdown scripting changes included in NUT v2.8.1 and later releases. Generally note that it is systems-dependent; for example Solaris 10+/illumos SMF core imposes hard timeouts to kill everything and halt the system when told to (maybe making this power-race-avoidance logic into a kernel driver that would block and reboot could be a solution).

Checking the concerns about upsmon -K not working, found that the POWERDOWNFLAG value must be set in upsmon.conf, there is no compiled-in default, but this bit of info is not really exposed - PR pending now. With the file existing and containing the magic string, upsmon -K currently (checked with NUT master after v2.8.2) does return exit code 0, so shell chaining with && works.

A timer for "OB LB" delay might indeed be an option - I suppose for short-lived glitches we could use a similar mechanism to what was recently introduced to avoid shutdowns during calibrations etc. when the UPS reports cycling different states - sometimes hovering in bogus limbo for a few seconds. This would probably also be tied to the number of POLLFREQ(ALERT) cycles - e.g. "ignore the state for X cycles in a row, issue FSD if not cleared by then". Now there's precedent for something like this in NUT v2.8.2 (maybe 2.8.1 already); need to check if this particular use-case was not actually addressed by now, e.g.:

CC @desertwitch for a "second opinion" :)

jimklimov commented 6 months ago

For a practical pointer, in the is_ups_critical() method the spot to extend would be just after the ST_CAL check, with throttling logic and data/state storage similar to pollfail_log_throttle_max and ups->pollfail_log_throttle_count implementation (peppered around the source):

https://github.com/networkupstools/nut/blob/60f76bfa9ab288b91b73ce90393971bc813f89d0/clients/upsmon.c#L1111-L1136

desertwitch commented 6 months ago

Thanks for the CC, commits look good and reasonable. I will read more and report back tomorrow, my (mental) bandwidth is a bit limited at the weekends. 😉

desertwitch commented 6 months ago

So I've just given this some thought and I think the only thing that hasn't been addressed yet is the OB LB switch to ignore the condition for a configurable time and only trigger FSD when that time has elapsed. I do for the most part agree with Charles about OB LB being a bit of a critical status to ignore, but given the recent anomalies with some UPS reporting it as part of calibration cycles I think it would be fair to offer it as a non-default option for cases where "ignorelb" would be too much. I'd keep default behavior to instant FSD upon encountering the status OB LB just to be safe, but allow users to modify this behavior accordingly with a similar setting as we've implemented for the intermittent OFF states we've seen (OFFDURATION). Either a configurable time or amount of encounters of the status, although I think a configurable time would be easier to approximate when facturing in the individual battery condition, seeing as most people probably know about how much load their batteries can still hold in minutes. So probably easier to think in time here rather than cycles - would also match better with OFFDURATION. Perhaps LBDURATION? Anyhow, the place recommended by Jim seems perfect for this.

jimklimov commented 6 months ago

Posted the remaining PR for the "payload" of this wish - testing/review would be welcome :)