networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.07k stars 351 forks source link

Smart UPS snmp-ups - delayed shutdown (reboot) #1956

Open LudekH2 opened 1 year ago

LudekH2 commented 1 year ago

I wonder what command the driver sends to the UPS after shutdown request. The shutdown sequence seems to be: "/etc/init.d/nut-server poweroff" that translates to "/sbin/upsdrvctl shutdown" that translates to "/lib/nut/snmp-ups -a APCUPSname -k" .

I run the nut-server on VM on ESXi that is being shutdown and need to trigger UPS delayed shutdown/reboot, so the ESXi has time to finish it's shutdown before the power is cut, as described here: https://community.se.com/t5/APC-UPS-Data-Center-Enterprise/Power-on-after-shutdown/td-p/344648 " ... the upsAdvControlRebootShutdownUps trap (.1.3.6.1.4.1.318.1.1.1.6.2.2) which does exactly what I want. It will go to sleep after configured Shutdown Delay which will power down the equipment and after some minutes (don't know where to configure that yet) it will power off the UPS. When the power returns, it switches on again."

Here is respective OID description: https://oidref.com/1.3.6.1.4.1.318.1.1.1.6.2.2

Regards, Ludek

jimklimov commented 1 year ago

Hello, the exact answer here depends on the driver used by NUT for particular medium and protocol (here snmp-ups) and the device for protocol dialects and other minute details (here I assume APC so one of MIBs relevant to that, e.g. for an UPS likely the https://github.com/networkupstools/nut/blob/master/drivers/apc-mib.c or fallback https://github.com/networkupstools/nut/blob/master/drivers/ietf-mib.c).

In those *-mib.c files you can review the mapping of NUT values and commands to OIDs involved; in case of value queries there may be generally several definitions and the first one listed that yields a reply "wins"; this allows to compactly handle nuances between mostly identical devices that differ a bit due to evolution, brand acquisition, etc.

For commands see entries with SU_TYPE_CMD flag; generally (in different drivers) there are several similar options depending on whether you want the device to turn off and stay off, or return/powercycle if "wall power" is/becomes available; immediately or with delay, etc. - subject to support by actual hardware. The off/on delays involved are usually handled by a setting (RW value) so the actual command refers to that ("please turn off after that whatever timeout we requested earlier").

jimklimov commented 1 year ago

Having said all of the above, I see that the OID you suggested above is in fact mentioned in apc-mib.c as APCC_OID_REBOOT, but is commented away for the past 19 years :\

https://github.com/networkupstools/nut/blame/master/drivers/apc-mib.c#L287

Maybe @aquette can remember why that was not enabled; OTOH I'd suppose lack of testing to be sure it works. Currently a "shutdown.return" definition points to ".1.3.6.1.4.1.318.1.1.1.6.1.1.0" instead.

LudekH2 commented 1 year ago

Many thanks for reply. The OID ".1.3.6.1.4.1.318.1.1.1.6.1.1.0 seems to translate to upsBasicControlConserveBattery https://oidref.com/1.3.6.1.4.1.318.1.1.1.6.1.1.0 https://networkupstools.org/protocols/snmp/APC-Powernet.pdf Cause a UPS running on battery to turn off its outlets to conserve battery runtime and then wait in “sleep mode” until acceptable input power returns. • noTurnOffUps (1): The value always returned for a GET. Setting this value has no effect. • turnOffUpsToConserveBattery (2): The UPS, if running on battery, waits in “sleep mode” until acceptable input power returns. If the UPS is not on battery, a badValue error is returned.

That gives me an idea what I can expect, though it is not ideal. The ESXi (and perhaps NUT server - I am not sure when the /etc/vmware-tools/scripts/poweroff-vm-default.d/ scripts are triggered) will not shutdown gracefully.

Will wait if @aquettehttps://github.com/aquette will comment.

Best regards Ludek

From: Jim Klimov @.> Sent: Monday, June 5, 2023 2:49 PM To: networkupstools/nut @.> Cc: Ludek Hejrovsky @.>; Author @.> Subject: [Marketing Mail] Re: [networkupstools/nut] Smart UPS snmp-ups - delayed shutdown (reboot) (Issue #1956)

Having said all of the above, I see that the OID you suggested above is in fact mentioned in apc-mib.c as APCC_OID_REBOOT, but is commented away for the past 19 years :\

https://github.com/networkupstools/nut/blame/master/drivers/apc-mib.c#L287

Maybe @aquettehttps://github.com/aquette can remember why that was not enabled; OTOH I'd suppose lack of testing to be sure it works. Currently a "shutdown.return" definition points to ".1.3.6.1.4.1.318.1.1.1.6.1.1.0" instead.

— Reply to this email directly, view it on GitHubhttps://github.com/networkupstools/nut/issues/1956#issuecomment-1576736496, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3OA7MWABIFLBWTJK5RFYULXJXIS7ANCNFSM6AAAAAAYYIQHXQ. You are receiving this because you authored the thread.Message ID: @.**@.>>

jimklimov commented 1 year ago

Just in case: what do you consider a graceful shudown here? Who is your NUT server (with the driver, telling UPS to turn off) - the hypervisor or a VM? If the former, (and if properly integrated with the OS), it should send that command as late in shutdown as it would normally tell its ATX PSU to poweroff - which means VMs are long down, filesystems unmounted or R/O, and the shutdown ends gracefully.

If the UPS does honour a delay setting, so actually cuts power say 10 or 60 sec after the command - so much the better. Then a VM could do it, as long as it is last in queue to shut down when the host goes down.

LudekH2 commented 1 year ago

Thanks again for your input. The nut-server with drivers and master client (monitor) are runnning in VM. The slave nut client service inside the ESXi host is supposed to tell ESXi to shutdown when fsd is raised by NUT server. The ESXi will shutdown VMs, the NUT server VM last in order. When NUT server VM get shutdown command from ESXi, vsphere client is executing it's shutdown script. This script looks for fsd flag file, if found, it sends shutdown command to the UPS. I expect it switches UPS output off immediately, not leaving room for ESXi to finish shutdown, because I have no option for UPS command "reboot/sleep with delay", mentioned earlier. Unfortunately, I can't test it much, because it is in production environment. Also NUT servers NUT master monitor shutdown event do nothing, since the shutdown command will come from the ESXi host going down, so we wait for it.

Best regards, Ludek.

Dne 6. 6. 2023 0:12 napsal uživatel Jim Klimov @.***>:

Just in case: what do you consider a graceful shudown here? Who is your NUT server (with the driver, telling UPS to turn off) - the hypervisor or a VM? If the former, (and if properly integrated with the OS), it should send that command as late in shutdown as it would normally tell its ATX PSU to poweroff - which means VMs are long down, filesystems unmounted or R/O, and the shutdown ends gracefully.

If the UPS does honour a delay setting, so actually cuts power say 10 or 60 sec after the command - so much the better.

— Reply to this email directly, view it on GitHubhttps://github.com/networkupstools/nut/issues/1956#issuecomment-1577550546, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3OA7MXKS3D4I6FS5XF7MULXJZKWBANCNFSM6AAAAAAYYIQHXQ. You are receiving this because you authored the thread.Message ID: @.***>

jimklimov commented 1 year ago

Well, you can try custom-building NUT (essentially to use the built snmp-ups binary in the shutdown sequence to kill power; the driver daemons would be stopped by then and the power-killers are spawned separately), if you can schedule a maintenance window. There you would be able to test different OID(s?).

Note however that when an FSD event happens, the "primary" (ex-"master") upsmon only waits some time for other "secondary" upsmon instances to log off from the data server - meaning they have shut down. After all other clients are away, OR after a timeout expires regardless of the remaining other clients, the primary begins its own shutdown and kills power as part of that, where possible. I'm afraid the ESXi host bringing down all its VMs can take a while, so the NUT VM would lose patience and cut power before the host is ready to let go.

LudekH2 commented 1 year ago

Jim, I greatly appreciate your input. I will try the custom-built snmp-ups, though, without lab UPS hardware for testing it is going to be difficult to verify. I am not much skilled in compiling code, but I can certainly give it a try. Do I understand you correctly, that I might place my custom-built snmp-ups executable to the unique location, to be called by shutdown script only and do not need to mess with the original binary file “/lib/nut/snmp-ups“ ?

Regarding to your last paragraph (master upsmon loosing patience), I intend to solve this by following line in the upsmon.conf on master NUT VM: SHUTDOWNCMD "/usr/bin/logger \"Should run /sbin/shutdown -H now, but waiting for ESXi host instead\"" I hope that does it. Credits to Oleg Semyonov here https://serverfault.com/questions/462993/vmware-esxi-shutdown-triggered-by-apc-ups-connected-via-usb

Also from Oleg Semyonov I have the idea of using /etc/vmware-tools/scripts/poweroff-vm-default.d/ script ( here: https://pastebin.com/KkEeanK1 ) for calling „/etc/init.d/ups-monitor poweroff“ as UPS final command. I noticed only later the existence of the NUT package provided file /lib/systemd/system-shutdown/nutshutdown , that should perform command /sbin/upsdrvctl shutdown during OS shutdown, if the FSD flag is raised. I do not know which of the two locations is better, I disabled the nutshutdown script for now. They should do both the same thing, but vmware scrit only runs for ESXi initiated shutdowns, which seems kind of better in my config. On the other hand if the nutshutdown is called later in the shutdown process it could give some advantage in situation where UPS shutdown delay isn’t working. The nutshutdown script on Debian contains single line “/sbin/upsmon -K >/dev/null 2>&1 && /sbin/upsdrvctl shutdown”

Best regards, Ludek

From: Jim Klimov @.> Sent: Tuesday, June 6, 2023 10:54 AM To: networkupstools/nut @.> Cc: Ludek Hejrovsky @.>; Author @.> Subject: [Marketing Mail] Re: [networkupstools/nut] Smart UPS snmp-ups - delayed shutdown (reboot) (Issue #1956)

Well, you can try custom-building NUT (essentially to use the built snmp-ups binary in the shutdown sequence to kill power; the driver daemons would be stopped by then and the power-killers are spawned separately), if you can schedule a maintenance window. There you would be able to test different OID(s?).

Note however that when an FSD event happens, the "primary" (ex-"master") upsmon only waits some time for other "secondary" upsmon instances to log off from the data server - meaning they have shut down. After all other clients are away, OR after a timeout expires regardless of the remaining other clients, the primary begins its own shutdown and kills power as part of that, where possible. I'm afraid the ESXi host bringing down all its VMs can take a while, so the NUT VM would lose patience and cut power before the host is ready to let go.

— Reply to this email directly, view it on GitHubhttps://github.com/networkupstools/nut/issues/1956#issuecomment-1578228979, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3OA7MTQVRYZOF3OFZB3CC3XJ3V23ANCNFSM6AAAAAAYYIQHXQ. You are receiving this because you authored the thread.Message ID: @.**@.>>

jimklimov commented 1 year ago

Yes, at least I hope so :) The new snmp-ups binary would refer to lib(net)snmp same as the packaged build, but otherwise is independent of the packaged build.

It would help to configure the custom build with same run-time user account and configuration location so it would find the same device definition to manage. For this, https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests may be useful. You may want to ensure the binaries would go to a different location than the packaged build however (e.g. to default /usr/local/ups/bin/...) or make sure you do not sudo make install to overwrite anything in the system. It being a VM, snapshots are a good idea :)

Neutering the SHUTDOWNCMD here is a good idea. I believe when ESXi tells this host to turn off, NUT code would still see the "killpower" flag file and cause the activity, but that would be timed at when the hypervisor says that this is the last machine standing, so sounds reasonable.

The logical chain here is that in olden days, everything was managed by scripts and so end-users tweaked their SYSV init-scripts to handle shutdown and call their driver(s) by name to command the UPS to turn off, in case /etc/killpower or similar file existed. A starting init-script made sure this file was deleted.

Later (still ~20 years ago) the upsdrvctl came along (and recently also upsdrvsvcctl for systemd/SMF integration) to process devices by name (or all devices) from ups.conf and so the same scripted code would in effect call the drivers and their configurations as defined on particular end-user machine.

In your case with a custom snmp-ups binary for the shutdown experiments, you may have to change those scripts to call this binary in legacy style, instead of upsdrvctl (which would look in packaged paths to call a driver).

jimklimov commented 1 year ago

Thanks for the reference to that SO answer and scripts, posting to Wiki :) https://github.com/networkupstools/nut/wiki/NUT-and-VMWare-(ESXi)

jimklimov commented 1 year ago

And with #1961 we now have newer VIB package referenced as well as the recipe to build them.