mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

check_ps_service sshd fails on Ubuntu #99

Open heitorPB opened 3 years ago

heitorPB commented 3 years ago

the check * || check_ps_service -u root -S sshd fails on Ubuntu 20.04. I know sshd running on the node because I logged in it with ssh and systemctl is-active sshd confirms it.

Changing sshd to ssh, testing for another user, or no user, also does not work.

I think the reason might be this function uses /sbin/service instead of systemd?

I compiled nhc from the dev branch.

mej commented 3 years ago

The only time(s) check_ps_service() uses the /sbin/service command is when performing a requested action (e.g., stop, restart) after completing the process search. So that can't be to blame for NHC not finding the sshd process to begin with. (Also, sytemd supports /sbin/service too...that's why I use it instead of invoking systemctl directly.)

Without additional info, I'm not sure I can help much; I do not run any Ubuntu systems. (I got a "bad taste in my mouth" from Debian many, many moons ago, and I've been so deep in the RedHat/RPM world since then that I've never felt like I needed to go back. Not to mention I'm waaaaaaay too old to start over! :stuck_out_tongue_winking_eye:)

If you run nhc with either the -d option for debugging or the -x option for tracing, that should help you pin down exactly what NHC is looking for and why it's not getting a match. You could also copy-and-paste the actual ps command (from the top of lbnl_ps.nhc) that NHC is using and run it by hand. But if all else fails, wading through the trace output (-x) will show you literally every single command NHC runs, what check_ps_service() is trying to match against sshd, and what it found.

For example, here's a snippet of the output on my system of nhc -x -l - -e 'check_ps_service -u root -S sshd' (which succeeds):

>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i++ ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i < 378 ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:413:check_ps_service()]> THIS_PID=2174
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:414:check_ps_service()]> [[ 0 == 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:417:check_ps_service()]> ARGS=(${PS_ARGS[$THIS_PID]})
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:418:check_ps_service()]> THIS_SVC=/usr/sbin/sshd
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:420:check_ps_service()]> dbg 'Checking 2174:  "*sshd" vs. "/usr/sbin/sshd"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:421:check_ps_service()]> mcheck /usr/sbin/sshd '*sshd'
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=/usr/sbin/sshd
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local 'MATCH=*sshd'
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ * == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ * == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ * == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob /usr/sbin/sshd '*sshd'
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ /usr/sbin/sshd == *sshd ]]
>>[{L6/S0/D4/R0}@common.nhc:287:mcheck_glob()]> dbg 'Glob match check:  /usr/sbin/sshd matches *sshd'
>>[{L6/S0/D5/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D5/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D4/R0}@common.nhc:288:mcheck_glob()]> return 0
>>[{L6/S0/D3/R0}@common.nhc:357:mcheck()]> return 0
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:424:check_ps_service()]> dbg 'Checking 2174:  "root" vs. "root"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:425:check_ps_service()]> [[ -n root ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:426:check_ps_service()]> mcheck root root
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=root
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local MATCH=root
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ r == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ r == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ r == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob root root
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ root == root ]]
>>[{L6/S0/D4/R0}@common.nhc:287:mcheck_glob()]> dbg 'Glob match check:  root matches root'
>>[{L6/S0/D5/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D5/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D4/R0}@common.nhc:288:mcheck_glob()]> return 0
>>[{L6/S0/D3/R0}@common.nhc:357:mcheck()]> return 0
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:431:check_ps_service()]> [[ -n '' ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:528:check_ps_service()]> return 0
>[{L6/S0/D1/R0}@nhc:717:()]> exit 0

As you can see above, the trace output from BASH includes filename, line number, and function information along with some BASH info and the return code of the prior command (e.g., R0 means the previous line was executed by BASH and returned an exit code of 0).

I re-ran the same command using sssshd instead of sshd to give you a hint of what to look for in the trace. Here is the block of the loop (in check_ps_service() from here to here) where NHC attempted to match the command to the requested service:

>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i++ ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:412:check_ps_service()]> (( i < 378 ))
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:413:check_ps_service()]> THIS_PID=2174
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:414:check_ps_service()]> [[ 0 == 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:417:check_ps_service()]> ARGS=(${PS_ARGS[$THIS_PID]})
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:418:check_ps_service()]> THIS_SVC=/usr/sbin/sshd
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:420:check_ps_service()]> dbg 'Checking 2174:  "*sssshd" vs. "/usr/sbin/sshd"'
>>[{L6/S0/D3/R0}@nhc:95:dbg()]> local PREFIX=
>>[{L6/S0/D3/R0}@nhc:97:dbg()]> [[ 0 == \1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:421:check_ps_service()]> mcheck /usr/sbin/sshd '*sssshd'
>>[{L6/S0/D3/R0}@common.nhc:323:mcheck()]> local STRING=/usr/sbin/sshd
>>[{L6/S0/D3/R0}@common.nhc:324:mcheck()]> local 'MATCH=*sssshd'
>>[{L6/S0/D3/R0}@common.nhc:325:mcheck()]> local i NEG=0
>>[{L6/S0/D3/R0}@common.nhc:328:mcheck()]> [[ * == \! ]]
>>[{L6/S0/D3/R0}@common.nhc:334:mcheck()]> [[ * == \/ ]]
>>[{L6/S0/D3/R0}@common.nhc:341:mcheck()]> [[ * == \{ ]]
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i=0 ))
>>[{L6/S0/D3/R0}@common.nhc:348:mcheck()]> (( i < 0 ))
>>[{L6/S0/D3/R0}@common.nhc:356:mcheck()]> mcheck_glob /usr/sbin/sshd '*sssshd'
>>[{L6/S0/D4/R0}@common.nhc:286:mcheck_glob()]> [[ /usr/sbin/sshd == *sssshd ]]
>>[{L6/S0/D4/R1}@common.nhc:290:mcheck_glob()]> dbg 'Glob match check:  /usr/sbin/sshd does not match *sssshd'

As you can see above, NHC has turned the service name specified (sshd in your case but sssshd in the above example) into a match string glob expression (*sssshd) that it then compares to the command name of each process (/usr/sbin/sshd above, but it's possible yours will differ -- which would explain why it's not finding the right process!) to see if it can find a running process that matches. You can see above that the process for sshd didn't match (on purpose, obviously), so NHC continues on its merry way.

Eventually it runs out of processes to check; once that happens, NHC knows that the specified process isn't there, so the check fails. The handling of that case begins here. (The exact line number may vary depending on which version you're running, but it's whatever line sets the initial value for $MSG as shown in the first line below.)

>>[{L6/S0/D2/R0}@lbnl_ps.nhc:541:check_ps_service()]> MSG='check_ps_service:  Service sssshd (process sssshd) owned by root not running; start'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:553:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:553:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R1}@lbnl_ps.nhc:573:check_ps_service()]> MSG='check_ps_service:  Service sssshd (process sssshd) owned by root not running; start in progress'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:552:check_ps_service()]> /bin/bash -c '/sbin/service sssshd start'
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:576:check_ps_service()]> [[ 0 -eq 1 ]]
>>[{L6/S0/D2/R0}@lbnl_ps.nhc:582:check_ps_service()]> die 1 'check_ps_service:  Service sssshd (process sssshd) owned by root not running; start in progress'

The trace output you see will continue inside the die() function, but I removed that from the above for brevity (and because it's irrelevant to your question/issue).

Hopefully that will help you troubleshoot! If I had to guess, I'd bet it's something about the root-owned sshd process not matching *sshd for some reason (like additional text following sshd). So that's where I'd start. Happy hunting!

heitorPB commented 3 years ago

This is very insightful! The sshd process in Ubuntu is reported as sshd: instead of sshd. And that's why it doesn't find it:

$ nhc -l - -e 'check_ps_service -u root sshd'
ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running
$ nhc -l - -e 'check_ps_service -u root sshd:'
$ ps ax | grep sshd
root 256 0.0 0.0 12176 7272 ? Ss 15:51 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 10962 0.0 0.0 13808 9024 ? Ss 16:58 0:00 sshd: ubuntu [priv]
ubuntu 11020 0.0 0.0 13944 6072 ? S 16:58 0:00 sshd: ubuntu@pts/0
$ ps ax | grep munge
10754 ? Sl 0:00 /usr/sbin/munged

it is interesting that sshd has this special name in Ubuntu. I wonder why? What is the best way forward, should I check for sshd:?

mej commented 3 years ago

root 256 0.0 0.0 12176 7272 ? Ss 15:51 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups root 10962 0.0 0.0 13808 9024 ? Ss 16:58 0:00 sshd: ubuntu [priv] ubuntu 11020 0.0 0.0 13944 6072 ? S 16:58 0:00 sshd: ubuntu@pts/0

it is interesting that sshd has this special name in Ubuntu. I wonder why? What is the best way forward, should I check for sshd:?

I had a feeling it would turn out to be something like that. :grin:

It's actually not uncommon to have a mismatch between the name of the service itself and the name of the process(es) associated with that service. For example, Postfix uses the service postfix, but the daemon process is master; you've also got small discrepancies like winbind's winbindd, seemingly disjoint ones (like nfslock with rpc.statd), and so on. I could cite all kinds of them!

That's why check_ps_service() provides the -ddaemon and -mmstr options. In many cases, like with winbind as I mentioned above, it's just the executable that differs, so NHC provides the -d flag which converts daemon into a match string *daemon. If you need greater flexibility -- as with your situation above -- you can directly specify the match string to use via the -m flag. In your case, I would recommend using either -m 'sshd: *' or -m '/^sshd: /'; they're functionally identical, so you can use whichever you prefer.

You can read up on all the available check_ps_service() flags and arguments here: https://github.com/mej/nhc/#check_ps_service