mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

External Match doesn't work for me #135

Closed Aelmazaty closed 1 year ago

Aelmazaty commented 1 year ago

Hello, I'm trying to group NHC checks per partition instead of per node. I have some nodes in a non-standard partition and will need to run different checks on them. Unfortunately they have the same naming convention as the standard nodes. So wildcards can not be used. Learning about external match possibility here https://github.com/mej/nhc/blob/9c4a38c0c9f48f92005c9120ca88145c33841dac/scripts/common.nhc#LL296

I've tried to add the following to /etc/sysconfig/nhc: NHC_MCHECK_DELIM=( [0]="@" ) NHC_MCHECK_COMMAND=( \ [0]="sinfo -p %m --format="%n" | grep -v HOSTNAMES | fgrep -w %h" \ )

and in "nhc.conf" I use @sra@ as "sra" is the partition name.

However this doesn't seem to be working. Checking the logs, "mcheck_external()" doesn't seem to be ever called. It seems to be trying to match it as glob ">[{L2/S0/D5/R1}@common.nhc:290:mcheck_glob()]> dbg 'Glob match check: hl-codon-113-01 does not match @sra@'"

Any hints on how to make this work? Thanks

mej commented 1 year ago

I have been digging into this off and on, in between coding and other work, since I first saw your post, and I'm afraid I can't say definitively that I know what's wrong. But I'll share my efforts so far, and hopefully we can get this fixed for you anyway!

First off, you didn't mention the specific version of NHC you're using. Everything in this post is based on the current dev branch here on GitHub (specifically, commit dc10825ea9); I did not test prior releases.

In order to get a clear view of what was going on, I started out by running nhc in trace mode with a very brief, very contrived configuration. It changed a bit over the course of all my testing, but in the end, I wound up with just 4 lines (the first two being for debugging/tracking purposes):

### test.conf
    *    || declare -p NHC_MCHECK_DELIM NHC_MCHECK_COMMAND
    *    || set | fgrep NHC_MCHECK_
  @gpu@  || echo "GPU node"
 !@gpu@  || echo "Not a GPU node"

For expedience, I opted to put the external match settings on the command line, at least initially, using essentially the same settings you provided above (save the partition name, of course), and I ran nhc for both a GPU and a non-GPU node. The commands I used were:

nhc -avl - -c test.conf HOSTNAME=some-gpu-node    NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'
nhc -avl - -c test.conf HOSTNAME=some-nongpu-node NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'

The above settings/commands work perfectly, so I'd be curious to hear whether or not they work on your system. You may have noticed the single- and the double-quoting of the sinfo command; careful quoting is essential due to the multiple layers of shell interpretation involved.

Unfortunately, when trying to move the external match settings from the command line to the global config in /etc/sysconfig/nhc, hilarity ensued; i.e., I started seeing exactly the behavior you described, with mcheck() trying to treat it as a glob instead of an external match. I threw numerous instances of that declare -p command (1st line of the config) all over the place to try and figure out where things were going astray. Come to find out, when setting the variables using export or declare, the values vanished once nhcmain_load_sysconfig() returned. Long story short, since the sysconfig file is sourced inside a function, any variables you declare will become local variables, just as local would.

So I was able to get it to work reliably using exactly these two lines:

NHC_MCHECK_DELIM=( [0]=@ )
NHC_MCHECK_COMMAND=( [0]='sinfo -hp %m --format="%n" | fgrep -qw %h' )

Can you try exactly those settings and see if they work for you?

Aelmazaty commented 1 year ago

Thanks a lot! I confirm the solution you suggested works on version 1.4.3-1

mej commented 1 year ago

Awesome! I'm very glad to hear you got it working. :)

I'll go ahead and close this, but let me know if you run into any other issues!