networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.07k stars 351 forks source link

Maybe dangerous interactions with APC BR1000G after full discharged recovery #1992

Open ms8x8x opened 1 year ago

ms8x8x commented 1 year ago

version: 2.7.4-13 debian version installed by apt. ups: APC BR1000G OS: PVE 7.3.3

with most cases, NUT works just fine.

But, if input power shorts for too long, UPS battery will fully discharged; it is a quite common case, right? When input power restored, UPS will be ONLINE, and PVE will power-on automatic too. That is my power-restore configuration.

Unfortunately in PVE starting process less than 10 sec, the UPS will definitely "detect" losing input power and fall back to ONBATT while the input power is ON!!! For the UPS just fully discharged, it will soon totally shutdown !!! Just in more seconds, the UPS will start ONLINE again and work fine.

This problem causes some data damages to my NAS and PVE platform, 👎

Why NUT? After the problem occurred, I contacted APC support team and run the following tests with the same conditions:

  1. connecting UPS to windows10, with APC's Power-Chute, works fine
  2. connecting UPS to PVE, without nut installed, works fine
  3. connecting UPS to PVE, with apcupsd, work fine
  4. connecting UPS to PVE, with nut installed, BOOM...
jimklimov commented 1 year ago

Hi, trying to clarify a few things: when you say the UPS goes down "dangerously", is this about a clean shutdown or a dirty power cut (command sent to UPS to turn off ASAP), or no command and UPS just turns off (maybe as per earlier settings posted by NUT)? Are there indications of "FSD" (Forced ShutDown) state, whether newly discovered or somehow inherited from the previous up-time?

Is there a setting recognized by your UPS (maybe not through NUT) for how eagerly it turns on load (e.g. after 10% or 100% charged)? And when it raises an LB/FSD alarm or similar, which NUT can interpret as a call to immediate shutdown?

ms8x8x commented 1 year ago

Hi, trying to clarify a few things: when you say the UPS goes down "dangerously", is this about a clean shutdown or a dirty power cut (command sent to UPS to turn off ASAP), or no command and UPS just turns off (maybe as per earlier settings posted by NUT)? Are there indications of "FSD" (Forced ShutDown) state, whether newly discovered or somehow inherited from the previous up-time?

The UPS just "lost" input power and fell back on battery, but the input power is normal. That is not a safe behavior coz the battery has already fully discharged and soon shutdown. I don't know why the UPS will "lose" input power when PVE OS booting with nut. Checking thru logs, there was no FSD state, just shutdown directly for battery shortage, I think.

as apcupsd and Power-Chute behave normally, how NUT interacts with UPS when ONLINE and battery very low? I said that "dangerously" coz this quick switching and shutting down is unexpected and harmful for Linux boxes including NAS. One Synology's storage has been damaged.

Is there a setting recognized by your UPS (maybe not through NUT) for how eagerly it turns on load (e.g. after 10% or 100% charged)? And when it raised an LB/FSD alarm?

There was no such settings, and the load was about less than 25%, and no LB/FSD alarm either.

jimklimov commented 1 year ago

I think one similar-sounding practical use-case I've seen was with enterprise UPSes (which send their own FSD alarms to subscribed clients, and NUT was among those). They effectively try to guarantee that if the UPS feeds its load, it guarantees enough time for safe shutdowns in case of an outage. So if there was one, and battery is depleted, it would normally not start the load until it charges sufficiently (so another outage is survivable in a clean fashion).

However, if it has been turned on manually before that threshold charge is reached (e.g. by admins to minimize data service outage when they know the disruption was intentional/fixed and an imminent new outage is not expected), the UPS still emits FSDs - so any clients boot, detect the alarm, and go down again.

Some rack APCs I've had some 20 years ago did have a delayed start-up so it could take an hour after power restoration for the racks to light up automatically, but I don't think a manual start-up was compromised into auto-shutdowns.

jimklimov commented 1 year ago

The UPS just "lost" input power and fell back on battery, but the input power is normal.

Are there any indications (LED lights, relay clicks etc.) that the UPS indeed thinks it is at that moment offline and deeply discharged - and so causes the shutdown (and/or the empty battery gives out)?

Thinking of it, are there such indications at other times the systems cold-boot while the UPS is charged?

One idea OTOH could be if the driver start-up (or some other services tied into packaging) would cause the UPS calibration or similar health-check - which indeed can tell it to go offline to check true battery state. I am not aware of NUT drivers doing that blindly, however there are command supports for calibration.start, calibration.stop as well as detection of CAL among ups.state values.

Otherwise, that does sound like a device behavior, not quite manageable by software. Maybe the burst-stress of powering several boxes (and they typically run through POST and disk spin-up at full amps, only adding power-saving/management later in the OS lifecycle)?..

jimklimov commented 1 year ago

Also, to check for the possibility that it might be some left-over of an earlier NUT lifetime (e.g. some bad interaction with an /etc/killpower file - location subject to your configs, and it should get ignored and removed upon boot if in a persistent filesystem at all): you can try draining the UPS so it powers off, then disconnecting the servers and charging the UPS alone, then turn it off again, connect all the load and power everything on. If the UPS is charged, but something in the filesystem is mis-interpreted and NUT (upsmon) decides to shut down again, this could get detected.

An opposite experiment could be to somehow disable NUT systemd units from starting, e.g. per https://www.freedesktop.org/software/systemd/man/systemd.unit.html by a line like:

[Unit]
ConditionPathExists=!/etc/nut/disable-autostart

added into the units (directly for tests, or ideally via drop-in configuration extension snippets like /etc/systemd/system/nut-monitor.d/disable-autostart.conf); don't forget to systemctl daemon-reload after editing the units one way or another. Note you can also try to dissect who causes the problem, e.g. nut-driver.service (driver start-up and interaction with the device) or all the way to nut-monitor.service or nut-client.service for the upsmon?..

Then you can touch /etc/nut/disable-autostart, unplug the UPS, let things shut down (including NUT's reaction to that), and re-plug it for things to power on. If the systemd setting works, NUT-related programs would not be even tried in the new lifetime (and perhaps you can start them manually one by one, after renaming/removing that touch-file), so you can learn if some particular NUT program's start-up is at fault to cause the shutdowns.

If it still mis-behaves with no NUT even starting (but after NUT being involved in the shutdown), this may be about some last settings/commands sent to the UPS as part of "kill power" processing.

ms8x8x commented 1 year ago

The UPS just "lost" input power and fell back on battery, but the input power is normal.

Are there any indications (LED lights, relay clicks etc.) that the UPS indeed thinks it is at that moment offline and deeply discharged - and so causes the shutdown (and/or the empty battery gives out)?

yes, BR1000G has a LED indicating input power lost and battery icon blinking empty. I think shutting down is normal, but why lost input power when nut driver loading?

Thinking of it, are there such indications at other times the systems cold-boot while the UPS is charged?

No, other cases, nut and ups co-operate just fine. only got problem when fully discharged recovery.

One idea OTOH could be if the driver start-up (or some other services tied into packaging) would cause the UPS calibration or similar health-check - which indeed can tell it to go offline to check true battery state. I am not aware of NUT drivers doing that blindly, however there are command supports for calibration.start, calibration.stop as well as detection of CAL among ups.state values.

Otherwise, that does sound like a device behavior, not quite manageable by software. Maybe the burst-stress of powering several boxes (and they typically run through POST and disk spin-up at full amps, only adding power-saving/management later in the OS lifecycle)?..

I thinks nut is related coz other softwares behave as expected. I want nut works fine coz my NAS need it.

jimklimov commented 1 year ago

By the way, for clarity: is this an USB-connected UPS? Which driver are you using?

Is there a setting recognized by your UPS (maybe not through NUT) for how eagerly it turns on load (e.g. after 10% or 100% charged)? And when it raised an LB/FSD alarm?

There was no such settings, and the load was about less than 25%, and no LB/FSD alarm either.

I meant few % of battery charge. I believe your "<25% load" meant actually load (that the NAS and PVE server wattage are well within UPS 1000W limits), right?

Depending on model, there may be an ups.delay.start setting you can try to set with upsrw; if supported by the device, it would tell the UPS to sit and charge for some time after an outage before turning on the load.

jimklimov commented 1 year ago

Thinking of it, are there such indications at other times the systems cold-boot while the UPS is charged?

No, other cases, nut and ups co-operate just fine. only got problem when fully discharged recovery.

Trying to rule out here if something like calibration (and so an "offline" state) is caused by NUT programs. So if the systems boot with an UPS charged, but it still blinks the LED or clicks the relay about going offline/online when NUT (driver probably) starts, it may implicate the driver logic but just is not "fatal" as in causing a shutdown due to battery also being empty at that moment.

ms8x8x commented 1 year ago

Also, to check for the possibility that it might be some left-over of an earlier NUT lifetime (e.g. some bad interaction with an /etc/killpower file - location subject to your configs, and it should get ignored and removed upon boot if in a persistent filesystem at all): you can try draining the UPS so it powers off, then disconnecting the servers and charging the UPS alone, then turn it off again, connect all the load and power everything on. If the UPS is charged, but something in the filesystem is mis-interpreted and NUT (upsmon) decides to shut down again, this could get detected.

if the battery is not drained, the whole rebooting process is just OK.

An opposite experiment could be to somehow disable NUT systemd units from starting, e.g. per https://www.freedesktop.org/software/systemd/man/systemd.unit.html by a line like:

[Unit]
ConditionPathExists=!/etc/nut/disable-autostart

added into the units (directly for tests, or ideally via drop-in configuration extension snippets like /etc/systemd/system/nut-monitor.d/disable-autostart.conf); don't forget to systemctl daemon-reload after editing the units one way or another. Note you can also try to dissect who causes the problem, e.g. nut-driver.service (driver start-up and interaction with the device) or all the way to nut-monitor.service or nut-client.service for the upsmon?..

I tried removing nut and switching to apcupsd, rebooting in this recovery case, the UPS behaved normally. I did not do much deeper testing to check which service, driver, server or client, cause the problem. Thanks for your advices.

Then you can touch /etc/nut/disable-autostart, unplug the UPS, let things shut down (including NUT's reaction to that), and re-plug it for things to power on. If the systemd setting works, NUT-related programs would not be even tried in the new lifetime (and perhaps you can start them manually one by one, after renaming/removing that touch-file), so you can learn if some particular NUT program's start-up is at fault to cause the shutdowns.

If it still mis-behaves with no NUT even starting (but after NUT being involved in the shutdown), this may be about some last settings/commands sent to the UPS as part of "kill power" processing.

jimklimov commented 1 year ago

I tried removing nut and switching to apcupsd, rebooting in this recovery case, the UPS behaved normally.

Yes, hence the contrived experiments suggested above.

ms8x8x commented 1 year ago

By the way, for clarity: is this an USB-connected UPS? Which driver are you using?

usb

Is there a setting recognized by your UPS (maybe not through NUT) for how eagerly it turns on load (e.g. after 10% or 100% charged)? And when it raised an LB/FSD alarm?

There was no such settings, and the load was about less than 25%, and no LB/FSD alarm either.

I meant few % of battery charge. I believe your "<25% load" meant actually load (that the NAS and PVE server wattage are well within UPS 1000W limits), right?

yes, only PVE host as the load for UPS when testing, within the UPS limits.

Depending on model, there may be an ups.delay.start setting you can try to set with upsrw; if supported by the device, it would tell the UPS to sit and charge for some time after an outage before turning on the load.

jimklimov commented 1 year ago

As one more option to try - NUT 2.7.4 is several years old. A 2.8.0 release went out last year with many changes and bug fixes, with some more fixes added since in the main branch (no release was cut yet though).

You can try to custom-build current codebase to check if something that impacts this behavior was actually fixed in the years passed since the code you are using: https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests

ms8x8x commented 1 year ago

As one more option to try - NUT 2.7.4 is several years old. A 2.8.0 release went out last year with many changes and bug fixes, with some more fixes added since in the main branch (no release was cut yet though).

You can try to custom-build current codebase to check if something that impacts this behavior was actually fixed in the years passed since the code you are using: https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests

Got it, great thanks, I will try the latest version.