networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.92k stars 346 forks source link

Missing pid-file upsmon.pid and no browsing possible #1721

Closed ullix closed 1 year ago

ullix commented 1 year ago

So far I am failing in installing nut 2.8.0 on my "Linux Mint 21" computer. This may be one of the issue, when trying a test-shutdown:

$ upsmon -c fsd
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

and there is no shutdown. The content of /run/nut is this:

image

Something else is working, though, so there is hope:

$ upsc eaton@10.0.0.51 ups.model
Init SSL without certificate database
Ellipse PRO 650 

Trying access with a webbrowser has completely failed. Apache is installed, and when using the url: http://10.0.0.51:3493/, it akes 1 ... 2 minutes(!) before I get back:

ERR INVALID-ARGUMENT
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND
ERR UNKNOWN-COMMAND

Same outcome when I try e.g. 10.0.0.51:3493/upsstats.cgi or variations of it.

jimklimov commented 1 year ago

Ugh, did you read the docs on NUT architecture and setup? :) Some hints:

ullix commented 1 year ago

Gosh, I think I read until my eyes fell off. I must say I find it poorly written in many aspects, and not giving some kind of summary for a really interested user, and I see no tools to tell me whatever may be missing.

You are trying to continue this by providing "hints"! Hell, why can't you just tell me what commands I need for my browser? Please!

I know what the fsd flag is meant for. I just see no obvious way to test my installation for correctness, and GUIDANCE what to do about it.

I had nut running years ago, and as it went flawless, I forgot about the details. Now I need to reestablish it on new systems. So, help is needed, not hints. Thank you.

jimklimov commented 1 year ago

Sorry about that, I mostly post answers while commuting so don't have good access to docs or to a system for modeling the setup, just the memory. Hence the "hints". For summaries there are many blogs that do a decent job. PRs are also welcome - uncomfortable docs are really something that bypassers notice better than people who see them for years and "get used" to them :\ Also the harder-core techie style apparently served well 15-20 years ago when most of this was written, and today more people want words minced on a platter.

And to "browse" you do need to set up a web-server, enable CGI support, and make it use NUT's *.cgi programs to serve them on the web. Most of this is non-NUT business and depends on the web server you use (apache, nginx, etc). Some of this has examples in https://github.com/networkupstools/nut/blob/master/data/html/README

The NUT-specific part is to configure /etc/nut/upsd.users file, host.conf, upsset.conf as listed at e.g. https://networkupstools.org/docs/user-manual.chunked/ar01s02.html#_cgi_programs

"Testing for correctness" generally goes to set up the NUT driver and data server on same system (you've got that working according to upsc responding in original post), and set up upsd.users and upsmon.conf to play together for access. Then you temporarily set SHUTDOWNCMD to touch some file (so you know upsmon tried to do its job), pull the plug and wait for low-battery status and check that the fake-shutdown was triggered.

Then with fsd you can try a real one, to check your UPS accepts a command to cut the electricity. Depending on OS you may require to barge into late shutdown scripts to avoid a "power race condition" (just reboot if wall power returned while you were shutting down, or stay up to drain the UPS and so powercycle it if it is not too manageable to power off when told to).

ullix commented 1 year ago

Good Lord, had it been that complicated using a web browser some years ago? I did use it, but don't recollect any need for so much action. I think I pass on it until the rest is solved.

The rest is this:

~$ sudo systemctl restart nut-server
~$ sudo systemctl restart nut-monitor

~$ sudo upsd -c reload
Network UPS Tools upsd 2.8.0
fopen /run/nut/upsd.pid: No such file or directory

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

~$ sudo upsmon -c fsd
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

~$ upsc eaton@10.0.0.51 ups.model
Init SSL without certificate database
Ellipse PRO 650 

This is the complete output of the commands given in that order. No error message of any kind after the systemctl calls.

But upsd and upsmon fail, yet upsc works correctly. What gives?

P.S. Thank you for the thorough answer!

jimklimov commented 1 year ago

Looking at recent discussions about Fedora's failed packaging, largely in the area of PID-file path setting, I wonder if your setup got similarly messed up (and notably there was a problem fixed in master branch just recently about configuring systemd-tmpfiles for similar issues) :\

Do you use a custom build of NUT or one packaged by distro?

Can you please check if:

My guess here is that either the programs trying to "command" their already running daemon instances are looking for a PID file in the wrong location (several reasons for "wrong" are possible), or that the daemons are trying to write into an absent directory (e.g. systemd-tmpfiles was never configured for NUT) or one they do not have rights to write into (may be related to that recently fixed issue).

It could help to add a few -D arguments for higher debug verbosities (up to 6) to track where the programs and daemons try to write their PID files, etc. Since NUT 2.8.0 you can also use a debug_min config-file option for drivers and upsd, to avoid changing init-scripts or systemd units; I am not sure OTOH if there was an equivalent made for upsmon.

ullix commented 1 year ago

I am using Linux Mint-Mate 21. All nut packages from its repositories.

Attach file should have all answers to your questions.

qnuts.txt

jimklimov commented 1 year ago

Thanks.

the word "pip" is not present in any of these files by case insensitive search: FWIW, "pid" not "pip" ;)

Seeing this:

/run/nut:
total 4
drwxrwx---  2 root nut    80 Nov 28 13:46 .
drwxr-xr-x 42 root root 1340 Nov 30 08:48 ..
srw-rw----  1 nut  nut     0 Nov 28 13:46 usbhid-ups-eaton
-rw-r--r--  1 nut  nut     5 Nov 28 13:46 usbhid-ups-eaton.pid

...it seems the default STATEPATH and ALTPIDPATH built into the package is (/var?)/run/nut and owned correctly. So at least if upsd (nut-server.service if systemd is involved) should have left its PID file here. So it may have failed to start. If you first started it before configuring ups.conf with device sections, it could have refused to run (there is an option to start without devices and wait for "reload" to slurp new configurations, but it is not default and destined for automated mass-monitoring systems per #766).

jimklimov commented 1 year ago

Good point raised in a different issue's discussion: lack of PID files for signaling may be due to "foregrounding" of the daemons under systemd units, to avoid "extra forking" and more difficult child-process tracking - newly with 2.8.0 release. In their logs:

Nov 30 19:06:53 mythtv.billgee.local nut-server[28918]: Running as foreground process, not saving a PID file
Nov 30 19:06:53 mythtv.billgee.local upsd[28918]: Running as foreground process, not saving a PID file

This might explain the successful unit startup vs. lack of PID files for upsd (nut-server). There is actually a upsd -FF (dual F) option support just for that, to save the PID file anyway. Not sure which one the systemd units in that packaging use.

As for upsmon however, it should always save the PID file (note: for unprivileged child half, if it is running in a dual-process model -- splitting root bit for shutdowns vs unprivileged for most of lifetime).

jimklimov commented 1 year ago

And looking up some more, the systemd unit integration actually says as much: https://github.com/networkupstools/nut/blob/master/scripts/systemd/nut-server.service.in#L23-L26

So systemctl reload nut-server should do the right thing (using $MAINPID as tracked by systemd) and would be idiomatically correct - managing a service completely by one framework, without confusing impacts outside its control.

ullix commented 1 year ago

;-) sorry for confusing with the typo. I did in fact search for pid, and verified again it is not there. Pip is found in a few pipe*.

The /var/run/ exists, but is a link to /run/.

I am not fully following your arguments. In the file nut.conf file is this section

#~ ALLOW_NO_DEVICE=true
#~ export ALLOW_NO_DEVICE

which I had un-commented at some point - as it sounded benign - but later re-commented. If this were the problem maker, could I recover by un-commenting again? Or what else would I have to do?

ullix commented 1 year ago

Tried it:

~$ sudo systemctl reload nut-server

~$ sudo upsmon -DDDD -c reload
Network UPS Tools upsmon 2.8.0
   0.000000 fopen /run/nut/upsmon.pid: No such file or directory

~$ sudo upsd -DDDD -c reload
Network UPS Tools upsd 2.8.0
   0.000000 fopen /run/nut/upsd.pid: No such file or directory

no change.

jimklimov commented 1 year ago

As long as you have the device configured in ups.conf (the eaton entry per earlier posts), this ALLOW_NO_DEVICE should not be needed (upsd may start because it already knows devices to represent), nor would it hurt. It just would not kick in :)

I was rather wondering if it could be a reason for nut-server to not start up initially (might be among reasons for why you had no PID file).

jimklimov commented 1 year ago

What did systemctl status nut-server or a possibly more detailed journalctl -lu nut-server show after systemctl reload?

Note: with that systemd unit remaining as is, the PID file won't appear and upsd -c reload verbatim would still not work. The unit would use upsd -c reload -P $PID arguments to tell that daemon process to reload. You can change to start upsd -FF in the nut-server.service definition and systemctl daemon-reload; systemctl restart nut-server to get that into effect and have it save a PID file too.

For that matter, do systemctl status nut-monitor or journalctl -lu nut-monitor expose any faults?

ullix commented 1 year ago

Uiiih, lots of failure reports, see attached file I notice in line 14: user monuser not found. This user thing has confused me a lot. I expected monuser to be known only to nut, or is it a Linux user? And I think a Linux-user 'nut' was also used somewhere.

It also didn't help that this username is switched between monuser and upsmon, like in file upsmon.conf (line ~73):

#   [upsmon]
#       password  = blah
#       upsmon primary  # (or secondary)

To me this also looks like [upsmon] defines the name, but in upsmon primary it defines something else?

Anyway, my settings are:

in upsmon.conf:
RUN_AS_USER monuser
MONITOR eaton@10.0.0.51 1 monuser mypass primary

in upsd.users:
[monuser]
   password  = mypass
   upsmon primary
   actions = SET
   instcmds = ALL

qstatus-monitor.txt

jimklimov commented 1 year ago

That's on to something... So indeed, the RUN_AS_USER refers to the OS (Linux) account, e.g. nut who owns those directories for PID files. Apparently upsmon fails to become_user() for the unprivileged part and so fails to start => no PID file and no daemon.

The monuser matched in both upsd.users section name AND in MONITOR line of upsmon.conf is the NUT-defined account for monitoring with allowed role upsmon primary (so persistent connection and certain messages that may be exchanged).

I'll check if the docs suggest mismatched example names, that would be unfortunate.

jimklimov commented 1 year ago

Looking at the status log above, for the past day the nut-monitor was not running (because "monuser not found"), and for some attempts before that (since Nov 26 18:23:06). Previously it did see the changes in UPS state, so the data-server and driver also ran. And the FSD test at Nov 26 17:55:42 apparently succeeded, with a real shutdown (of the OS at least).

ullix commented 1 year ago

Snailing forward:

~$ sudo systemctl restart nut-server
~$ sudo systemctl restart nut-monitor
~$ sudo upsd -DDDD -c reload
Network UPS Tools upsd 2.8.0
   0.000000 fopen /run/nut/upsd.pid: No such file or directory

~$ systemctl status nut-monitor
● nut-monitor.service - Network UPS Tools - power device monitor and shutdown controller
     Loaded: loaded (/lib/systemd/system/nut-monitor.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-12-01 11:47:56 CET; 5min ago
   Main PID: 76225 (upsmon)
      Tasks: 2 (limit: 8742)
     Memory: 2.5M
        CPU: 28ms
     CGroup: /system.slice/nut-monitor.service
             ├─76225 /lib/nut/upsmon -F
             └─76226 /lib/nut/upsmon -F

Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 systemd[1]: Started Network UPS Tools - power device monitor and shutdown controller.
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: fopen /run/nut/upsmon.pid: No such file or directory
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: Could not find PID file to see if previous upsmon instance is already running!
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: UPS: eaton@10.0.0.51 (primary) (power value 1)
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: Using power down flag file /etc/killpower
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76226]: Init SSL without certificate database

~$ journalctl -lu nut-monitor
...
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 systemd[1]: Started Network UPS Tools - power device monitor and shutdown controller.
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: fopen /run/nut/upsmon.pid: No such file or directory
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: Could not find PID file to see if previous upsmon instance is already running!
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: UPS: eaton@10.0.0.51 (primary) (power value 1)
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76225]: Using power down flag file /etc/killpower
Dec 01 11:47:56 ullix-ProLiant-MicroServer-Gen10 nut-monitor[76226]: Init SSL without certificate database
ullix commented 1 year ago

More hope: upsmon did find the pid file, but upsd did not?

~$ sudo upsmon -DDDD -c reload
Network UPS Tools upsmon 2.8.0

~$ sudo upsd -DDDD -c reload
Network UPS Tools upsd 2.8.0
   0.000000 fopen /run/nut/upsd.pid: No such file or directory
jimklimov commented 1 year ago

Each daemon has its own PID file.

Now that upsmon started successfully (RUN_AS_USER fixed back?) it wrote the file and so can receive commands at the PID recorded there.

The upsd would not record the PID file unless you change nut-server.service to ExecStart=.../upsd -FF -- and you might never need to change that at all, if you would systemctl reload nut-server instead. Not that you should do that often...

If the new nut-driver-enumerator.(path|service) is enabled on your system, edits of ups.conf would be picked up to dynamically (re-)define nut-driver@eaton.service ("eaton" being the section name), and (re-)start/reload needed services.

jimklimov commented 1 year ago

I've also posted a PR #1724 where I first thought to "fix" the systemd units, but ended up rather documenting why the current situation makes sense (for daemons managed by OS service framework), and hopefully made it better debugable in the field - with more actionable error messages.

ullix commented 1 year ago

Yes, fixed back; I have now:

in upsmon.conf:
RUN_AS_USER nut

but still failure:

$ systemctl reload nut-server

~$ sudo upsd -DDDD -c reload
Network UPS Tools upsd 2.8.0
   0.000000 fopen /run/nut/upsd.pid: No such file or directory

~$ sudo upsmon -DDDD -c reload
Network UPS Tools upsmon 2.8.0
jimklimov commented 1 year ago

I've posted probable reasons. I don't have info if your nut-server service currently runs the daemon as upsd -FF and so did/tried-to save a PID file and we now fail to use it, or never tried to save it (as upsd -F by systemd unit default).

The lines logged in systemd status nut-server may show if preceding systemd reload nut-server reported any errors (using the $MAINPID number directly, without the file).

Also don't know if you need to reload these daemons this way all day long in practice (vs. experimentation), so whether the remaining point is moot? ;)

ullix commented 1 year ago

I have no idea what you mean. With this upsd -FF the results are:

~$ sudo upsd -FF
Network UPS Tools upsd 2.8.0
fopen /run/nut/upsd.pid: No such file or directory
Could not find PID file '/run/nut/upsd.pid' to see if previous upsd instance is already running!

not listening on 10.0.0.51 port 3493
no listening interface available
jimklimov commented 1 year ago

You said you start it as nut-server systemd unit. That in turn starts upsd with some arguments (I suppose -F so no PID file). Now above you started another copy of the daemon, which failed to listen on the already-occupied port 3493 I guess.

ullix commented 1 year ago

Ok, I get this. But if the -FF option is important to write a pid file then how can I start upsd with double F?

Or, the reverse: when you programmers decided to make a default start for single F, which does NOT write a pid file, then isn't it reasonable to assume that this pid file is not needed at all? But then why made you upsd continuing to complain about missing something that was intentionally left out?

jimklimov commented 1 year ago

How ... ?

That is a non-NUT question but rather one of managing your OS of choice and its ways of starting services ;P

Technically, for a quick fix you can look up the systemd unit configuration file used in your system and edit it directly (*):

$ systemctl status nut-server
...
     Loaded: loaded (/lib/systemd/system/nut-server.service; enabled; vendor preset: enabled)
...

$ sudo vi /lib/systemd/system/nut-server.service
> ExecStart=...

$ sudo systemctl daemon-reload
$ sudo systemctl restart nut-server

(*) Note that this quick-edit is practical, but prone to overwrites by packaging when your OS is upgraded and new NUT build lands. Systemd supports several locations to customize "drop-in" amendments to unit definitions (by packaging, site-local, run-time, etc.) so you could use that:

:; sudo mkdir -p /etc/systemd/system/nut-server.service.d
:; sudo vi /etc/systemd/system/nut-server.service.d/use-pid.conf
[Service]
# Copy path from your packaged definition
ExecStart=.../upsd -FF

:; sudo systemctl daemon-reload
:; sudo systemctl restart nut-server

Why ... ?

May be attributed to human error :) and lack of that use-case in one's practice. Conversely, may be deemed intentional, as now strongly implied by PR #1724 - if you (or your distro packagers) set up the system entities to be managed by an official framework like systemd, then people should not poke sticks into the wheels by directly sending signals and looking at PID files. It may work for indefinite future, but is plain untidy to fully support two potentially opposing approaches at the same time. At least, now the error messages would be indicative of that - if you use "systemd" for the services, do so throughout:

$ ./server/upsd -c reload
Network UPS Tools upsd 2.8.0-175-g3d5537378
fopen /var/state/ups/upsd.pid: No such file or directory
Try 'systemctl reload nut-server.service' or add '-P $PID' argument

Note there are still systems without systemd or SMF which would use the decades-old approaches with init-scripts and PID files. Also on the deeper technical background side, upsd -c something means that your new upsd program instance signals the running daemon instance - so one way or another, should find it (currently by PID file or explicit -P NUMBER argument). Hence complaints when it can not fulfill your request.


All this hassle is practically really needed only if you expect to edit config files which impact upsd (ups.conf with new/deleted device entries, upsd.conf, upsd.users) and to reload it "live" as opposed to just restarting the service with a couple of seconds of its downtime. In case of driver definition addition/removal, different ways are available including nut-driver-enumerator (service and script) to define nut-driver instances and reload upsd, or to manually run systemctl reload nut-server instead of direct upsd -c reload.

Also note that the original problem was "missing upsmon.pid" and eventually that one got solved, and you can send direct upsmon -c fsd signals which do have a practical use :)

ullix commented 1 year ago

I really have no inclination of fiddling with the start-up files of any package, though it is great to get some insight into the complexity of it :-).

But what all this means is that the error message upsd about missing pid should just be ignored, because the lack of this pid is intentional? Please, have some mercy with willing users, who need as little as possible confusion, and remove the message when it is not needed.

Back to upsmon. Why has it lost the pid on the second call?

~$ sudo systemctl restart nut-monitor.service 

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

~$ sudo systemctl restart nut-monitor.service 

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0

~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory
jimklimov commented 1 year ago

Well, digging deep was useful for you and for the project, uncovering more rough edges buried in "accustomedness" :)

Partial "problem" is that utility commands try to do what they were asked about or log inability to do that, without a context that inability could be expected. In some cases this is addressed by suffixing another message:

Generally the user may know (or dig to discover) the problematic situation - e.g. if you end up with two upsd instances running and conflicting (like you had with inability to start due to no free port to listen on). There may be myriad bad scenarios, so different bread-crumbs add up to a diagnosis.

jimklimov commented 1 year ago

Ran some experiments detailed in https://github.com/networkupstools/nut/issues/1728#issuecomment-1335371798 and as much as I tried, I could not reproduce the problem in plain command line, with upsmon (mis?-)behavior alone.

So far tending to blame systemd for the mishaps... Could you please try to reproduce the issue again and post the systemd unit's journal?

:; sudo journalctl -flu nut-monitor & PIDJ=$! ; sleep 1
:; (set -x; sudo systemctl restart nut-monitor.service ; sleep 5 ; sudo upsmon -c reload ; sleep 5; sudo upsmon -c reload ; sleep 5 )
:; kill $PIDJ

I wonder if systemd detects an incoming signal to monitored process and somehow toxically processes it (e.g. deletes the PID file it claims to have problem monitoring)?..

jimklimov commented 1 year ago

Hm, also just noticed that in your tests you were doing systemctl restart nut-monitor.service (not reload) - did you check if the daemon was actually running after that?

Maybe it failed soon but not instantly after start, or did not even exit yet, so handled the signal first time if delay was short (although it does not remove the PID file on its own anyway)?.. Just guessing and grasping at straws :)

ullix commented 1 year ago

So, the first request:

~$ sudo journalctl -flu nut-monitor & PIDJ=$! ; sleep 1
[1] 98550

[1]+  Stopped                 sudo journalctl -flu nut-monitor

~$ (set -x; sudo systemctl restart nut-monitor.service ; sleep 5 ; sudo upsmon -c reload ; sleep 5; sudo upsmon -c reload ; sleep 5 )
+ sudo systemctl restart nut-monitor.service
[sudo] password for ullix:       
+ sleep 5
+ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
+ sleep 5
+ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory
+ sleep 5

~$ kill $PIDJ

~$

and this is: journalctl -lu nut-monitor for today only:

Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 systemd[1]: Started Network UPS Tools - power device monitor and shutdown controller.
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98557]: fopen /run/nut/upsmon.pid: No such file or directory
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98557]: Could not find PID file to see if previous upsmon instance is already running!
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98557]: UPS: eaton@10.0.0.51 (primary) (power value 1)
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98557]: Using power down flag file /etc/killpower
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98559]: Init SSL without certificate database
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98559]: UPS eaton@10.0.0.51: forced shutdown in progress
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98559]: Executing automatic power-fail shutdown
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98564]: wall: /dev/pts/4: No such file or directory
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98561]: Network UPS Tools upsmon 2.8.0
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98565]: wall: /dev/pts/4: No such file or directory
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98559]: Auto logout and shutdown proceeding
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98568]: wall: /dev/pts/4: No such file or directory
Dec 03 09:55:10 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98566]: Network UPS Tools upsmon 2.8.0
Dec 03 09:55:15 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98559]: Network UPS Tools upsmon 2.8.0
Dec 03 09:55:15 ullix-ProLiant-MicroServer-Gen10 nut-monitor[98557]: Network UPS Tools upsmon 2.8.0
Dec 03 09:55:15 ullix-ProLiant-MicroServer-Gen10 systemd[1]: nut-monitor.service: Deactivated successfully.
ullix commented 1 year ago

... did you check if the daemon was actually running after that?

much of this stuff is beyond my pay-grade ;-) . Have mercy and be specific: how do I check for the daemon running ?

That is what I thought you might have meant:

~$ systemctl restart nut-monitor.service

~$ upsc eaton@10.0.0.51
Init SSL without certificate database
battery.charge: 100
battery.charge.low: 20
...
ullix commented 1 year ago

This gets weirder by the minute. I did (full output, no other commands in between:

~$ sudo systemctl restart nut-monitor.service

~$ upsmon -c reload
Network UPS Tools upsmon 2.8.0
kill: Operation not permitted

~$ upsmon -c reload
Network UPS Tools upsmon 2.8.0
fopen /run/nut/upsmon.pid: No such file or directory

~$ 

Where did the kill come from? And not seen in this terminal, but in one on another computer, ssh connected to the nut server computer:

ullix@ullix-ProLiant-MicroServer-Gen10:~$ 
Broadcast message from nut@ullix-ProLiant-MicroServer-Gen10 (somewhere) (Sat De
UPS eaton@10.0.0.51: forced shutdown in progress                               
Broadcast message from nut@ullix-ProLiant-MicroServer-Gen10 (somewhere) (Sat De
Executing automatic power-fail shutdown                                        
Broadcast message from nut@ullix-ProLiant-MicroServer-Gen10 (somewhere) (Sat De
Auto logout and shutdown proceeding                                            

Unfortunately, everything beyond 80 chars is cut-off, so no date and time visible (is this intended by nut, or do I have a setting which makes the cut-off?)

To be clear: I have NOT given a command with the -c fsd option, and my ups is running fine at 100% charge, and full line power. In upsmon.conf I have:

SHUTDOWNCMD "touch /home/ullix/SHUTDOWNCMD.msg"
POWERDOWNFLAG /etc/killpower
# NOTIFYCMD /bin/notifyme
NOTIFYCMD /usr/bin/notify-send
jimklimov commented 1 year ago

The cut-off would be something about OS console settings, NUT should not trim any messages it emits.

So far from those log messages it seems that your earlier experiments with upsmon -c fsd caused it to raise the forced-shutdown in-memory flag in nut-server (upsd) data entry for that UPS. This sort of flag never goes down - server farm shutdowns can not be generally safely aborted at a random point mid-flight, so once they start they must complete and systems be rebooted to reinitialize cleanly. I think you can systemctl restart nut-server to clear the flag, however.

My guess now would be that upsmon starts (and again, you restarted and not reloaded it in recent posts too); then soon afterwards it detects that FSD must happen, so triggers SHUTDOWNCMD and exits itself. Not sure who (systemd?) deletes upsmon.pid - but for the second attempt it is not there. Notably, in this particular situation systemctl reload nut-monitor would not succeed because there is nobody to signal :)

Also:

SHUTDOWNCMD "touch /home/ullix/SHUTDOWNCMD.msg"

Does nut have rights to write into /home/ullix? Maybe world-writable /tmp, /dev/shm or /var/tmp would fare better for the tests.

jimklimov commented 1 year ago

... did you check if the daemon was actually running after that?

much of this stuff is beyond my pay-grade ;-)

Mine too, I do this for free, for fun and experience :)

Have mercy and be specific: how do I check for the daemon running ?

That is what I thought you might have meant:

~$ systemctl restart nut-monitor.service

~$ upsc eaton@10.0.0.51
Init SSL without certificate database
battery.charge: 100
battery.charge.low: 20
...

I meant restarting and checking the same daemon -- something like systemctl restart nut-monitor.service ; sleep 10; systemctl status nut-monitor.service and whether the latter would say the service is active, running, with such and such PID number - or if it had exited/died shortly after starting (by the time your second attempt usually does not find the PID file).

On a side note, upsc responding means that two other daemons (NUT driver for this device, and NUT server to represent it on the network) do currently run and communicate.

jimklimov commented 1 year ago
~$ sudo systemctl restart nut-monitor.service

~$ upsmon -c reload
Network UPS Tools upsmon 2.8.0
kill: Operation not permitted

Where did the kill come from?

That attempt probably succeeded in the split second that the upsmon daemon was recently alive and its PID file existed, so upsmon -c ... found a process number to signal. By the time it tried to do so, the process was gone so signalling failed. Unix signals are a tack on top of kill command in libc (one of many signals being the actual "kill"), hence the message.

ullix commented 1 year ago

Does nut have rights to write into /home/ullix? Maybe world-writable /tmp, /dev/shm or /var/tmp would fare better for the tests.

Advice taken, thanks.

So the whole caboodle was due to a left-over fsd flag? If so, then a prominent comment in the docs would be worth while!

In the meantime I have rebooted the server (and renamed the too lengthy hostname, in case you wonder). But it does not seem to have improved; pid file is missing:

ullix@Gen10:~$ systemctl restart nut-monitor.service ; sleep 10; systemctl status nut-monitor.service
● nut-monitor.service - Network UPS Tools - power device monitor and shutdown controller
     Loaded: loaded (/lib/systemd/system/nut-monitor.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-12-03 11:41:23 CET; 10s ago
   Main PID: 5880 (upsmon)
      Tasks: 2 (limit: 8742)
     Memory: 832.0K
        CPU: 9ms
     CGroup: /system.slice/nut-monitor.service
             ├─5880 /lib/nut/upsmon -F
             └─5882 /lib/nut/upsmon -F

Dec 03 11:41:23 Gen10 systemd[1]: Started Network UPS Tools - power device monitor and shutdown controller.
Dec 03 11:41:23 Gen10 nut-monitor[5880]: fopen /run/nut/upsmon.pid: No such file or directory
Dec 03 11:41:23 Gen10 nut-monitor[5880]: Could not find PID file to see if previous upsmon instance is already running!
Dec 03 11:41:23 Gen10 nut-monitor[5880]: UPS: eaton@10.0.0.51 (primary) (power value 1)
Dec 03 11:41:23 Gen10 nut-monitor[5880]: Using power down flag file /etc/killpower
Dec 03 11:41:23 Gen10 nut-monitor[5882]: Init SSL without certificate database

Strange, because:

ullix@Gen10:~$ sudo systemctl restart nut-monitor.service
ullix@Gen10:~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
ullix@Gen10:~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
ullix@Gen10:~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
ullix@Gen10:~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0
ullix@Gen10:~$ sudo upsmon -c reload
Network UPS Tools upsmon 2.8.0

;-))

jimklimov commented 1 year ago

Well, the first time it starts, there is no PID file. It says as much - "can't tell if I am alone or would conflict with a sibling" the best way computers can tell.

jimklimov commented 1 year ago

As for docs... https://networkupstools.org/docs/user-manual.chunked/ar01s06.html "6.3. Configuring automatic shutdowns for low battery events" does say that:

Warning

By design, since we require power-cycling the load and don’t want some systems to be powered off while others remain running if the "wall power" returns at the wrong moment as usual, the "FSD" flag can not be removed from the data server unless its daemon is restarted. If we do take the first step in critical mode, then we intend to go all the way — shut down all the servers gracefully, and power down the UPS.

Keep in mind that some UPS devices and corresponding drivers would latch the "FSD" again even if "wall power" is available, but the remaining battery charge is below a threshold configured as "safe" in the device (usually if you manually power on the UPS after a long power outage). This is by design of respective UPS vendors, since in such situation they can not guarantee that if a new power outage happens, their UPS would safely shut down your systems again. So it is deemed better and safer to stay dark until batteries become sufficiently charged.

I guess it might be expanded in https://networkupstools.org/docs/man/upsmon.html though (PRs welcome)

Too many docs touch on different aspects of same concepts, and there are tons of nuances to make people aware about... somewhere... so writers get lost in the maze too.

ullix commented 1 year ago

Now that I know I understand (partially). But this points to the problem: you are explaining the background to the people "in the know", and are not explaining to the (new) user!

Re this part man upsmon (as pic to keep format):

image

You see visually that the paragraph "Forced Shutdown" is separated by two new paragraphs from "SIMULATING POWER FAILURES" while these two really belong together. And after mentioning the "dummy SHUTDOWNCMD setting" - I cannot imagine that anyone building a new nut system would not use a dummy command - you fail to EXPLICITLY say what command to issue to finalize and get rid of the fsd flag.

That command, in my understanding now, is sudo systemctl restart net-server (correct?).

Otherwise I would have to reboot my server, kind of defeating the purpose of a dummy command to avoid putting the server into a reboot.

jimklimov commented 1 year ago

Yes, probably. As far as NUT goes, one needs to "restart data server". Nuances on doing that differ between modern Linux with systemd, legacy linux before systemd, *BSD, MacOS, Solaris, Windows and wherever else NUT can run, and are really OS management detail, not NUT detail. You picked the OS you run, you learn to use it :)

I understand this is not the comfy answer one might like, and that many FOSS projects are torpedoed by their docs - but as long as there is nearly a one-man show budgeting a few hours a month, and at that - hours of someone well-versed in the ecosystem and inclined to solve the infinite backlog of technical challenges, it is the end-users who can notice at all if docs are insufficient, and give suggestions or better yet PRs to make them better.

Good point on shuffling paragraphs closer together so they make better sense with content available today. Probably someone sometime ago edited this to add a few lines more to a short chapter here and there, and the islands of related text drifted further and further apart.

On a side note, though probably part of the problem, the way developers see docs they write is this:

Styles and colors and block-panels are something that appears in a renderer for HTML, PDF, etc. and not something we look at every year. Too many too heavy tools are needed to generate that at all, and with development often happening via terminal to some Linux/Unix system (not on local desktop) too much work is needed to get those pretty renditions seen in a graphical system.

ullix commented 1 year ago

Amen.

But still, I need more help!

jimklimov commented 1 year ago

"Help me help you" - what was the question? What NUT technical problem remains un-solved? :)

(A daemon starting for the first time in current OS uptime and telling you there was no previous PID file is not a problem, it is a clue for troubleshooting when you try to start same daemon 10 times and fail because a sibling does run and block resources, but no PID file existed for whatever reason your forensic quest would discover)

jimklimov commented 1 year ago

As for browsing, it is mostly about setup of your web-server of choice. If you use Apache, this doc still can help: https://github.com/networkupstools/nut/blob/master/data/html/README

Particular tools are documented at

ullix commented 1 year ago

Sorry, that was not intended as a rough comment as you seem to have it taken. There was a cumulative effect of several misunderstandings, based on the problem that the docs are written more for those "in the know" than for newcomers like me. Now that the first problem got solved, I need to attach other computers into the shutdown sequence.

I am wondering what problems I may stumbling over. I'll try and see what comes up.

Thank you for your support.