mej / nhc

LBNL Node Health Check
Other
226 stars 79 forks source link

Pattern error message #130

Open brunoagneray opened 1 year ago

brunoagneray commented 1 year ago

Hi,

We use lbnl-nhc-1.4.3-1 version.

We have nodes with name like cluster-n[01-99] and storage nodes with name like cluster-nfs[01-99].

We have the following lines in our nhc.conf file :

{cluster-n[01-99]} || export NHC_RM= {cluster-nfs[01-99]} || export NHC_RM=

When executing the command 'nhc -a' on a storage node in cluster-nfs, we encounter the following error message like :

/etc/nhc/scripts/common.nhc: line 201: [[: 10#fs15: value too great for base (error token is "10#fs15")

Regards,

Bruno

mej commented 1 year ago

I think what's going on here is that NHC is getting confused by the fact that the leading portion (what the code refers to as the PREFIX) of the hostname cluster-nfs15, when matched against the range expression cluster-n[01-99], is taken to be cluster-n. Once that gets trimmed off, it then tries to treat the remainder of the hostname (i.e., fs15) as a number that it then tries to compare with the range 01-99 to see if the "number" fs15 falls within that range.

Because Bash auto-interprets numbers in bases other than 10 under certain circumstances, the range-matching code prepends 10# to the numeric variables to ensure they get treated as base-10 numbers in all cases. In this situation, however, fs is getting erroneously lumped into the numeric value, and as the error message says, f and s don't fall within the range of digits that are valid for base-10 numbers.

I'll see if I can reproduce the problem myself by hand, but if you'd be willing to attach the output from running nhc -ax on that cluster-nfs15 host, that'd help a lot! 😀 In the meantime, though, the error shouldn't be causing any actual breakage -- range expression matching should still be working accurately, right?

Thanks for reporting the bug!

PS: As a possible workaround, for the time being, you could change to a glob expression like cluster-n[0-9][0-9] or a regular expression like /^cluster-n[[:digit:]]+$/.

brunoagneray commented 1 year ago

Hi Michael,

Thanks for your answer.

Please find the output of the 'nhc -ax' command on cluster-nfs15.

PS: As a possible workaround, for the time being, you could change to a glob expression like |cluster-n[0-9][0-9]| or a regular expression like |/^cluster-n[[:digit:]]+$/|.

We use the same nhc.conf on all our nodes (heterogenous nodes, in SLURM or not), there is 42 patterns (pdsh pattern with {}) to modify.

As the errors are only present on our spiro-nfs[01-15] nodes, and the reason of this messages is identified without impact, we will be patient.

Many thanks for your support !

Best regards,

Bruno

Bruno AGNERAY - DSI Service Infrastructure Système et Réseaux / Calcul Scientifique Intensif Tél: +33 1 46 73 44 10 Mail @.***

ONERA - The French Aerospace Lab - Centre de Châtillon 29, avenue de la Division Leclerc - BP 72 - 92322 CHÂTILLON CEDEX

Le 20/04/2023 à 22:26, Michael Jennings a écrit :

I think what's going on here is that NHC is getting confused by the fact that the leading portion (what the code refers to as the |PREFIX|) of the hostname |cluster-nfs15|, when matched against the range expression |cluster-n[01-99]|, is taken to be |cluster-n|. Once that gets trimmed off, it then tries to treat the remainder of the hostname (i.e., |fs15|) as a number that it then tries to compare with the range |01-99| to see if the "number" |fs15| falls within that range.

Because Bash auto-interprets numbers in bases other than 10 under certain circumstances, the range-matching code prepends |10#| to the numeric variables https://github.com/mej/nhc/blob/1.4.3/scripts/common.nhc#L201 to ensure they get treated as base-10 numbers in all cases. In this situation, however, |fs| is getting erroneously lumped into the numeric value, and as the error message says, |f| and |s| don't fall within the range of digits that are valid for base-10 numbers.

I'll see if I can reproduce the problem myself by hand, but if you'd be willing to attach the output from running |nhc -ax| on that |cluster-nfs15| host, that'd help a lot! 😀 In the meantime, though, the error shouldn't be causing any actual breakage -- range expression matching should still be working accurately, right?

Thanks for reporting the bug!

PS: As a possible workaround, for the time being, you could change to a glob expression like |cluster-n[0-9][0-9]| or a regular expression like |/^cluster-n[[:digit:]]+$/|.

— Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/130#issuecomment-1516903492, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARF4OLJ57JF2NLY6BPYKTP3XCGLXXANCNFSM6AAAAAAWVBWFIY. You are receiving this because you authored the thread.Message ID: @.***>

@. ~]# nhc -ax @.:342:nhcmain_parse_cmdline()]> dbg 'BASH tracing active.' @.:99:dbg()]> local PREFIX= @.:101:dbg()]> [[ '' == \1 ]] @.:328:nhcmain_parse_cmdline()]> getopts :D:ac:de:fhl:n:qt:vx OPTION @.:347:nhcmain_parse_cmdline()]> shift 1 @.:348:nhcmain_parse_cmdline()]> [[ ! -z '' ]] @.:352:nhcmain_parse_cmdline()]> return 0 @.:729:main()]> nhcmain_load_sysconfig @.:359:nhcmain_load_sysconfig()]> [[ -f /etc/sysconfig/nhc ]] @.:730:main()]> nhcmain_finalize_env @.:367:nhcmain_finalize_env()]> CONFDIR=/etc/nhc @.:368:nhcmain_finalize_env()]> CONFFILE=/etc/nhc/nhc.conf @.:369:nhcmain_finalize_env()]> INCDIR=/etc/nhc/scripts @.:370:nhcmain_finalize_env()]> HELPERDIR=/usr/libexec/nhc @.:371:nhcmain_finalize_env()]> ONLINE_NODE=/usr/libexec/nhc/node-mark-online @.:372:nhcmain_finalize_env()]> OFFLINE_NODE=/usr/libexec/nhc/node-mark-offline @.:373:nhcmain_finalize_env()]> LOGFILE='>>/var/log/nhc.log 2>&1' @.:374:nhcmain_finalize_env()]> RESULTFILE=/var/run/nhc/nhc.status @.:375:nhcmain_finalize_env()]> DEBUG=0 @.:376:nhcmain_finalize_env()]> TS=0 @.:377:nhcmain_finalize_env()]> SILENT=0 @.:378:nhcmain_finalize_env()]> VERBOSE=0 @.:379:nhcmain_finalize_env()]> MARK_OFFLINE=1 @.:380:nhcmain_finalize_env()]> DETACHED_MODE=0 @.:381:nhcmain_finalize_env()]> DETACHED_MODE_FAIL_NODATA=0 @.:382:nhcmain_finalize_env()]> TIMEOUT=30 @.:383:nhcmain_finalize_env()]> NHC_CHECK_ALL=1 @.:384:nhcmain_finalize_env()]> NHC_CHECK_FORKED=0 @.:385:nhcmain_finalize_env()]> export NHC_SID=0 @.:385:nhcmain_finalize_env()]> NHC_SID=0 @.:388:nhcmain_finalize_env()]> kill -s 0 -- -784937 @.:389:nhcmain_finalize_env()]> [[ 0 -eq 0 ]] @.:391:nhcmain_finalize_env()]> dbg 'NHC process 784937 is session leader.' @.:99:dbg()]> local PREFIX= @.:101:dbg()]> [[ 0 == \1 ]] @.:392:nhcmain_finalize_env()]> NHC_SID=-784937 @.:405:nhcmain_finalize_env()]> [[ -n '' ]] @.:410:nhcmain_finalize_env()]> [[ >>/var/log/nhc.log 2>&1 != >>\/\v\a\r\/\l\o\g\/\n\h\c.\l\o\g\ \2>\&\1 ]] @.:413:nhcmain_finalize_env()]> [[ >>/var/log/nhc.log 2>&1 == - ]] @.:418:nhcmain_finalize_env()]> [[ -z '' ]] @.:419:nhcmain_finalize_env()]> nhcmain_find_rm @.:455:nhcmain_find_rm()]> local DIR @.:456:nhcmain_find_rm()]> local -a DIRLIST @.:458:nhcmain_find_rm()]> [[ -d /var/spool/torque ]] @.:461:nhcmain_find_rm()]> [[ -n '' ]] @.:468:nhcmain_find_rm()]> type -a -p -f -P scontrol @.:471:nhcmain_find_rm()]> type -a -p -f -P pbsnodes @.:474:nhcmain_find_rm()]> type -a -p -f -P qselect @.:477:nhcmain_find_rm()]> type -a -p -f -P badmin @.:477:nhcmain_find_rm()]> type -a -p -f -P sbatchd @.:482:nhcmain_find_rm()]> [[ -z '' ]] @.:483:nhcmain_find_rm()]> dbg 'Unable to detect resource manager.' @.:99:dbg()]> local PREFIX= @.:101:dbg()]> [[ 0 == \1 ]] @.:484:nhcmain_find_rm()]> return 1 @.:420:nhcmain_finalize_env()]> ONLINE_NODE=: @.:421:nhcmain_finalize_env()]> OFFLINE_NODE=: @.:422:nhcmain_finalize_env()]> MARK_OFFLINE=0 @.:425:nhcmain_finalize_env()]> [[ '' == \s\g\e ]] @.:436:nhcmain_finalize_env()]> [[ 0 -ne 0 ]] @.:443:nhcmain_finalize_env()]> [[ -n '' ]] @.:445:nhcmain_finalize_env()]> [[ 0 -eq 1 ]] @.:451:nhcmain_finalize_env()]> export NAME CONFDIR CONFFILE INCDIR HELPERDIR ONLINE_NODE OFFLINE_NODE LOGFILE DEBUG TS SILENT TIMEOUT NHC_RM @.:731:main()]> [[ -n '' ]] @.:736:main()]> nhcmain_redirect_output @.:489:nhcmain_redirect_output()]> [[ -n >>/var/log/nhc.log 2>&1 ]] @.:490:nhcmain_redirect_output()]> exec @.***:710:nhcmain_finish()]> exit 0