networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.94k stars 346 forks source link

Powercool 3U Rackmount 1200VA device mostly usable... #1344

Closed lws-team closed 2 years ago

lws-team commented 2 years ago

... with git build of NUT and hunnox subdriver on nutdrv_qx. So thanks for the work to make that happen.

But there is one thing... if I turn the load off using upscmd, I cannot succeed to turn the load back on with upscmd. The UPS shows "SHUT DN" on its LCD and doesn't exit this state unless I hold the frontpanel button down for 3s.

Is this considered normal behaviour? I really want to be able to programmatically power down the load to the UPS completely and power it back on when something to do, independent of its battery backup behaviours.

lws-team commented 2 years ago

A bit more information, the load does actually come back on about 30 minutes or an hour after the command was sent.

Looking at the debug output of the nutdrv_qx while setting upsrw -s ups.delay.start=1, it does not seem to issue anything to the UPS at the time it's sent, it notes a server var was set but there's no additional debug flow to emit a command over USB.

The upsc output while it's powering the load and after trying the upsrw is

battery.voltage: 27.60
device.type: ups
driver.name: nutdrv_qx
driver.parameter.langid_fix: 0x409
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: MEC0003
driver.parameter.productid: 0000
driver.parameter.subdriver: hunnox
driver.parameter.synchronous: auto
driver.parameter.vendorid: 0001
driver.version: 2.7.4-5059-ga8e3687a
driver.version.data: Q1 0.07
driver.version.internal: 0.32
driver.version.usb: libusb-1.0.23 (API: 0x1000107)
input.frequency: 50.1
input.voltage: 243.0
input.voltage.fault: 0.0
output.voltage: 245.0
ups.beeper.status: disabled
ups.delay.shutdown: 30
ups.delay.start: 0
ups.load: 22
ups.productid: 0000
ups.status: OL
ups.temperature: 29.0
ups.type: online
ups.vendorid: 0001
lws-team commented 2 years ago

I ended up delving into the UPS innards and finding it has its own problems outside of NUT.

I wrote it up here for interested parties.

https://warmcat.com/2022/03/26/efficient-ci-ups.html

jimklimov commented 2 years ago

Thanks, great article!

As a writer to writer, I'll try to comment here a few clarifications for the peace of mind :)

1) Regarding features in packaged vs. git NUT: indeed, the last release was a few years too many ago; hopefully new 2.8.0 is just around the corner by now. Distros tend to not package the latest and greatest, from git HEADs, not even the "rolling model" ones. With things that do benefit from testing against obscure hardware, I don't blame them :)

2) About "required" python... it is a bit of an achilles' foot indeed. The long story is, that autoconf requires that templates for configure script to eventually parse are present before we convert configure.ac (m4) to shell script. Templates may be empty for a null parse, so the autogen.sh script should have recommended to export WITHOUT_NUT_AUGEAS=true if it did not find a python interpreter, then it would just touch the empty file instead of generating one.

Historically, I guess since this file is generated from sources it was not tracked in Git. Some others are however, to the point that a mismatched generation (unexpected git diff) is among sanity checks for those during CI iterations.

On the other hand, historically developers were expected to have a large toolkit for dependencies, and packagers had the nut-x.y.z.tar.gz archive created by make dist and published as a release, usable "as is" without regenerating autoconf etc.

3) Your example shows to chgrp nobody /var/state/ups - this depends on run-time user accounts for the data server (upsd) and driver daemons, who they all run as to create/write/read the socket files there. Indeed, nobody:nogroup (or nobody:nobody on systems without nogroup) is the configure default, but it is all tunable.

4) Not sure if "defaults to install in /usr/local/ups/..." was a complaint :) But generally non-packaged custom builds go there by old defaults, and something vetted by distro maintainers (anyone can fill that role for their organization) would go into system FS trees. Some OSes put it to /usr, some even upstairs (/bin, /libexec...) to have the UPS related programs available when all non-rootfs partitions or datasets got unmounted.

5) Driver not detecting that the UPS became "stale" sounds like a bug. At least, NUT does have a concept for connections going AWOL and coming back. If the device is queried and responds with a lie, that is another story... but not-answering should be treated as a connection loss; gibberish might need better handling. Perhaps if you could conjure up and test in-vivo a PR to do handle this better... would be great :)

Something along the lines of we were last known to tell the UPS to power down, or even we don't know and somebody else powered it off, but just now it responds garbage to status query - ask for something else, and diagnose it as "stale" until it begins to answer reasonably at some point.

6) I read up on that a few times, in code and docs - the sticky FSD (once raised, can't be cleared from upsd until daemon restart) seems to be there by design :) Essentially, you don't want a rack half-powered off and in unpredictable state. If your power went critical, you want every device recycled (usually by telling everyone to halt, and telling UPS to power down and eventually boot when the coast is clear and batteries are charged), and everything started back in proper order. The term in docs was "power-race condition" about wall-power appearing back during your shutdown and so precluding the UPS from discharging and powering off unless commanded to - a situation that happens way too often in practice.

On a similar note, some UPSes may also claim their power is critical even if they are already charging, but are not full enough to guarantee that they would let the load shut down safely in case of a new outage. In this case, if the rack load is powered back on too early (e.g. manually), NUT would claim a new FSD and tell everyone to abandon ship ASAP. Which is annoying, but technically also by design (not quite NUT's but rather all physical parties involved) and makes sense after a while.

As an idea, I suppose you can fiddle with the "master"/"primary" mode upsmon on your NUT RPi4 set up in the MONITOR line as powering 0 inputs of the RPi... I think it might avoid raising the immutable FSD, but would instead rely on all other upsmon instances on your servers (running as "slave"/"secondary" with a non-zero inputs count summing up to the MINSUPPLIES PSU count of the box) going down ASAP - as soon as they detect a critical power state via upsd (on battery + low battery). Not fully sure that would work, but might :)

Also maybe look into upssched that allows for more complex logic to react to power events, instead of upsmon aimed at orchestrating timely shutdowns.

7) Regarding the nut-monitor.service being After=nut-server.service -- was that issue seen with the older packaged NUT on the clients (2.7.4 or older)? I believe the current service file in git should have this line as well as Wants=nut-server.service (could be Requires earlier or delivered by distros as "their" choice). The current code is expected to tell systemd to try and start the nut-server, and if it starts - to start nut-monitor after that; but I don't think it should fail if nut-server is disabled.


Looking at all these work-arounds, I wonder if NUT as a project should revise the stance that "packages is something distros do" and provide some sort of "reference packaging" that everyone can build from git consistently, installing which would set up the accounts, services, FS rights... hopefully that would be somehow cross-pollinated with what popular distros do, so maybe their next-release packages would be just calls of a reference make package :)

lws-team commented 2 years ago

Yes the article is meant to be informative, but also detailed to capture what I had to do so if I have to do it again, I'm not starting from scratch.

Your example shows to chgrp nobody /var/state/ups - this depends on run-time user accounts

Right... it describes the case of using Rocky Linux, I found I had to do that. I don't know what else the packaging in some other OS would have to do.

Not sure if "defaults to install in /usr/local/ups/..." was a complaint

No I think it's fine, it would work on, eg, bsd or whatever well. I highlighted it because it's not how it's packaged in Fedora, the package-built configure is given options to align it with Fedora style (eg, put things in /usr/bin). You have to be made aware if switching between packaged and git because eg, /etc is expected to be somewhere else and it's not looking at the same config then.

Driver not detecting that the UPS became "stale" sounds like a bug.

I think so... the hunnox driver does understand that it received 0x05 and feels that is unexpected. But it deals with it like just that transaction was corrupted, the caller doesn't take away from that, that the status is now "unknown" or that the status has some TTL. It feels it just couldn't update what it has which it treats as valid. It doesn't seem to see 0x05 as a status with a specific meaning just as something garbled.

immutable FSD

I saw sometimes a reported status FSD OL FSD, I think there are at least two sticky points for it, one in nut-server, since I can clear it by restarting that service (that is my workaround atm) and the other is the stale UPS data.

The current code is expected to tell systemd to try and start the nut-server,

For whatever reason, nut-monitor would be DOA on reboot on noi, the big rack server (not the RPi4) after it was started; starting it by hand worked OK subsequently. I had to make those changes to have it come up from boot. noi doesn't have the Fedora package installed so presumably it's using the git service files.

test in-vivo a PR to do handle this better.

I'm going to digest all this and have a think about what to do next if anything. I didn't expect the hardware to be like it is, if I didn't miss the point, I don't think actually it can be made to work properly just from NUT side, the communication seems shot once it is OFF on the hot side and we have to come around via the button as a side channel, I don't know how load.on can work (again, unless I miss the point and got a wrong idea somewhere). Initially I thought it was trouble around the NUT driver but I think maybe it cannot work right without a hardware hack. I guess if you never try to control it programmatically, but let it run during an outage until the batteries exhausted, it may be that it works for that just fine, or maybe people go around and press the button by hand to bring it back after an outage and don't really mind. But that won't do for my use-case.

I appreciate this is a difficult space to provide FOSS for, for the reasons out of anyone's control I described in the article. The code made we want to unify it, there are signs the authors feel the same, and it's not that there's a lack of experience, simply the situation resists any consolidation being able to be meaningfully tested.

jimklimov commented 2 years ago

For a bit of sanity-check, was the nut-monitor service enabled when it did not start? and wanted by something that starts with the system (nut targets wanted by multi-user target, or a similar chain)?

lws-team commented 2 years ago

Yes nut-monitor.service was enabled on the client. systemctl status nut-monitor showed that it had not tried to start, but starting by hand always worked, which led me to After. What I ended up with that worked OK was this

[Unit]
Description=Network UPS Tools - power device monitor and shutdown controller
After=local-fs.target network.target
Wants=nut-server.service
PartOf=nut.target

[Service]
EnvironmentFile=-/usr/local/ups/etc/nut.conf
SyslogIdentifier=%N
ExecStart=/usr/local/ups/sbin/upsmon -F
ExecReload=/usr/local/ups/sbin/upsmon -c reload
PIDFile=/var/run/upsmon.pid

[Install]
WantedBy=nut.target
jimklimov commented 2 years ago

Thanks, that's a good concern to investigate some more before cutting a release, then :)

jimklimov commented 2 years ago

Can't confirm on Rocky Linux 8.5 and Ubuntu 20.04:

et voila:

(Note: first attempts did not move away the nut-server.service definition so systemd would forget about its existence; the nut-client.service still was able to start :) )

lws-team commented 2 years ago

Well, thanks for looking at that. That particular NUT-client box is actually using Fedora 35 but I don't think it would make much difference.

The big problem for me is around waking that UPS device programmatically after shutdown.stayoff, sending C doesn't do it... I can do it but it requires the button hack. That can be a problem with the box, like the random restarting. At any rate with the hack I can go on.