Closed Aelmazaty closed 1 year ago
I have been digging into this off and on, in between coding and other work, since I first saw your post, and I'm afraid I can't say definitively that I know what's wrong. But I'll share my efforts so far, and hopefully we can get this fixed for you anyway!
First off, you didn't mention the specific version of NHC you're using. Everything in this post is based on the current dev branch here on GitHub (specifically, commit dc10825ea9); I did not test prior releases.
In order to get a clear view of what was going on, I started out by running nhc
in trace mode with a very brief, very contrived configuration. It changed a bit over the course of all my testing, but in the end, I wound up with just 4 lines (the first two being for debugging/tracking purposes):
### test.conf
* || declare -p NHC_MCHECK_DELIM NHC_MCHECK_COMMAND
* || set | fgrep NHC_MCHECK_
@gpu@ || echo "GPU node"
!@gpu@ || echo "Not a GPU node"
For expedience, I opted to put the external match settings on the command line, at least initially, using essentially the same settings you provided above (save the partition name, of course), and I ran nhc
for both a GPU and a non-GPU node. The commands I used were:
nhc -avl - -c test.conf HOSTNAME=some-gpu-node NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'
nhc -avl - -c test.conf HOSTNAME=some-nongpu-node NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'
The above settings/commands work perfectly, so I'd be curious to hear whether or not they work on your system. You may have noticed the single- and the double-quoting of the sinfo
command; careful quoting is essential due to the multiple layers of shell interpretation involved.
Unfortunately, when trying to move the external match settings from the command line to the global config in /etc/sysconfig/nhc
, hilarity ensued; i.e., I started seeing exactly the behavior you described, with mcheck()
trying to treat it as a glob instead of an external match. I threw numerous instances of that declare -p
command (1st line of the config) all over the place to try and figure out where things were going astray. Come to find out, when setting the variables using export
or declare
, the values vanished once nhcmain_load_sysconfig()
returned. Long story short, since the sysconfig file is sourced inside a function, any variables you declare
will become local variables, just as local
would.
So I was able to get it to work reliably using exactly these two lines:
NHC_MCHECK_DELIM=( [0]=@ )
NHC_MCHECK_COMMAND=( [0]='sinfo -hp %m --format="%n" | fgrep -qw %h' )
Can you try exactly those settings and see if they work for you?
Thanks a lot! I confirm the solution you suggested works on version 1.4.3-1
Awesome! I'm very glad to hear you got it working. :)
I'll go ahead and close this, but let me know if you run into any other issues!
Hello, I'm trying to group NHC checks per partition instead of per node. I have some nodes in a non-standard partition and will need to run different checks on them. Unfortunately they have the same naming convention as the standard nodes. So wildcards can not be used. Learning about external match possibility here https://github.com/mej/nhc/blob/9c4a38c0c9f48f92005c9120ca88145c33841dac/scripts/common.nhc#LL296
I've tried to add the following to /etc/sysconfig/nhc: NHC_MCHECK_DELIM=( [0]="@" ) NHC_MCHECK_COMMAND=( \ [0]="sinfo -p %m --format="%n" | grep -v HOSTNAMES | fgrep -w %h" \ )
and in "nhc.conf" I use @sra@ as "sra" is the partition name.
However this doesn't seem to be working. Checking the logs, "mcheck_external()" doesn't seem to be ever called. It seems to be trying to match it as glob ">[{L2/S0/D5/R1}@common.nhc:290:mcheck_glob()]> dbg 'Glob match check: hl-codon-113-01 does not match @sra@'"
Any hints on how to make this work? Thanks