Open heitorPB opened 3 years ago
The only time(s) check_ps_service()
uses the /sbin/service
command is when performing a requested action (e.g., stop, restart) after completing the process search. So that can't be to blame for NHC not finding the sshd
process to begin with. (Also, sytemd
supports /sbin/service
too...that's why I use it instead of invoking systemctl
directly.)
Without additional info, I'm not sure I can help much; I do not run any Ubuntu systems. (I got a "bad taste in my mouth" from Debian many, many moons ago, and I've been so deep in the RedHat/RPM world since then that I've never felt like I needed to go back. Not to mention I'm waaaaaaay too old to start over! :stuck_out_tongue_winking_eye:)
If you run nhc
with either the -d
option for debugging or the -x
option for tracing, that should help you pin down exactly what NHC is looking for and why it's not getting a match. You could also copy-and-paste the actual ps
command (from the top of lbnl_ps.nhc) that NHC is using and run it by hand. But if all else fails, wading through the trace output (-x
) will show you literally every single command NHC runs, what check_ps_service()
is trying to match against sshd
, and what it found.
For example, here's a snippet of the output on my system of nhc -x -l - -e 'check_ps_service -u root -S sshd'
(which succeeds):
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i++ ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i < 378 ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:413:check_ps_service()]> THIS_PID=2174
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:414:check_ps_service()]> [[ 0 == 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:417:check_ps_service()]> ARGS=(${PS_ARGS[$THIS_PID]})
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:418:check_ps_service()]> THIS_SVC=/usr/sbin/sshd
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:420:check_ps_service()]> dbg 'Checking 2174: "*sshd" vs. "/usr/sbin/sshd"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:421:check_ps_service()]> mcheck /usr/sbin/sshd '*sshd'
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=/usr/sbin/sshd
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local 'MATCH=*sshd'
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ * == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ * == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ * == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob /usr/sbin/sshd '*sshd'
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ /usr/sbin/sshd == *sshd ]]
>>[{L6/S0/D4/R0}@common.nhc:287:mcheck_glob()]> dbg 'Glob match check: /usr/sbin/sshd matches *sshd'
>>[{L6/S0/D5/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D5/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D4/R0}@common.nhc:288:mcheck_glob()]> return 0
>>[{L6/S0/D3/R0}@common.nhc:357:mcheck()]> return 0
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:424:check_ps_service()]> dbg 'Checking 2174: "root" vs. "root"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:425:check_ps_service()]> [[ -n root ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:426:check_ps_service()]> mcheck root root
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=root
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local MATCH=root
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ r == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ r == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ r == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob root root
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ root == root ]]
>>[{L6/S0/D4/R0}@common.nhc:287:mcheck_glob()]> dbg 'Glob match check: root matches root'
>>[{L6/S0/D5/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D5/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D4/R0}@common.nhc:288:mcheck_glob()]> return 0
>>[{L6/S0/D3/R0}@common.nhc:357:mcheck()]> return 0
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ -n '' ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:528:check_ps_service()]> return 0
>[{L6/S0/D1/R0}@nhc:717:()]> exit 0
As you can see above, the trace output from BASH includes filename, line number, and function information along with some BASH info and the return code of the prior command (e.g., R0
means the previous line was executed by BASH and returned an exit code of 0
).
I re-ran the same command using sssshd
instead of sshd
to give you a hint of what to look for in the trace. Here is the block of the loop (in check_ps_service()
from here to here) where NHC attempted to match the command to the requested service:
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i++ ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i < 378 ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:413:check_ps_service()]> THIS_PID=2174
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:414:check_ps_service()]> [[ 0 == 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:417:check_ps_service()]> ARGS=(${PS_ARGS[$THIS_PID]})
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:418:check_ps_service()]> THIS_SVC=/usr/sbin/sshd
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:420:check_ps_service()]> dbg 'Checking 2174: "*sssshd" vs. "/usr/sbin/sshd"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:421:check_ps_service()]> mcheck /usr/sbin/sshd '*sssshd'
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=/usr/sbin/sshd
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local 'MATCH=*sssshd'
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ * == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ * == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ * == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob /usr/sbin/sshd '*sssshd'
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ /usr/sbin/sshd == *sssshd ]]
>>[{L6/S0/D4/R1}@common.nhc:290:mcheck_glob()]> dbg 'Glob match check: /usr/sbin/sshd does not match *sssshd'
As you can see above, NHC has turned the service name specified (sshd
in your case but sssshd
in the above example) into a match string glob expression (*sssshd
) that it then compares to the command name of each process (/usr/sbin/sshd
above, but it's possible yours will differ -- which would explain why it's not finding the right process!) to see if it can find a running process that matches. You can see above that the process for sshd
didn't match (on purpose, obviously), so NHC continues on its merry way.
Eventually it runs out of processes to check; once that happens, NHC knows that the specified process isn't there, so the check fails. The handling of that case begins here. (The exact line number may vary depending on which version you're running, but it's whatever line sets the initial value for $MSG
as shown in the first line below.)
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:541:check_ps_service()]> MSG='check_ps_service: Service sssshd (process sssshd) owned by root not running; start'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:553:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:553:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:573:check_ps_service()]> MSG='check_ps_service: Service sssshd (process sssshd) owned by root not running; start in progress'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:552:check_ps_service()]> /bin/bash -c '/sbin/service sssshd start'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:576:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:582:check_ps_service()]> die 1 'check_ps_service: Service sssshd (process sssshd) owned by root not running; start in progress'
The trace output you see will continue inside the die()
function, but I removed that from the above for brevity (and because it's irrelevant to your question/issue).
Hopefully that will help you troubleshoot! If I had to guess, I'd bet it's something about the root
-owned sshd
process not matching *sshd
for some reason (like additional text following sshd
). So that's where I'd start. Happy hunting!
This is very insightful! The sshd
process in Ubuntu is reported as sshd:
instead of sshd
. And that's why it doesn't find it:
$ nhc -l - -e 'check_ps_service -u root sshd'
ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running
$ nhc -l - -e 'check_ps_service -u root sshd:'
$ ps ax | grep sshd
root 256 0.0 0.0 12176 7272 ? Ss 15:51 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 10962 0.0 0.0 13808 9024 ? Ss 16:58 0:00 sshd: ubuntu [priv]
ubuntu 11020 0.0 0.0 13944 6072 ? S 16:58 0:00 sshd: ubuntu@pts/0
$ ps ax | grep munge
10754 ? Sl 0:00 /usr/sbin/munged
it is interesting that sshd
has this special name in Ubuntu. I wonder why? What is the best way forward, should I check for sshd:
?
root 256 0.0 0.0 12176 7272 ? Ss 15:51 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 10962 0.0 0.0 13808 9024 ? Ss 16:58 0:00 sshd: ubuntu [priv]
ubuntu 11020 0.0 0.0 13944 6072 ? S 16:58 0:00 sshd: ubuntu@pts/0
it is interesting that
sshd
has this special name in Ubuntu. I wonder why? What is the best way forward, should I check forsshd:
?
I had a feeling it would turn out to be something like that. :grin:
It's actually not uncommon to have a mismatch between the name of the service itself and the name of the process(es) associated with that service. For example, Postfix uses the service postfix
, but the daemon process is master
; you've also got small discrepancies like winbind
's winbindd
, seemingly disjoint ones (like nfslock
with rpc.statd
), and so on. I could cite all kinds of them!
That's why check_ps_service()
provides the -d
daemon
and -m
mstr
options. In many cases, like with winbind
as I mentioned above, it's just the executable that differs, so NHC provides the -d
flag which converts daemon
into a match string *daemon
. If you need greater flexibility -- as with your situation above -- you can directly specify the match string to use via the -m
flag. In your case, I would recommend using either -m 'sshd: *'
or -m '/^sshd: /'
; they're functionally identical, so you can use whichever you prefer.
You can read up on all the available check_ps_service()
flags and arguments here: https://github.com/mej/nhc/#check_ps_service
the check
* || check_ps_service -u root -S sshd
fails on Ubuntu 20.04. I knowsshd
running on the node because I logged in it withssh
andsystemctl is-active sshd
confirms it.Changing
sshd
tossh
, testing for another user, or no user, also does not work.I think the reason might be this function uses
/sbin/service
instead of systemd?I compiled nhc from the
dev
branch.