mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

add option to tell nhc use long or short hostname when mark node state #110

Closed taleintervenor closed 11 months ago

taleintervenor commented 2 years ago

Consider nodes name with domain "pi.sjtu.edu.cn", such as "node838.example.edu.cn": Current version of nhc always use long hostname

function nhcmain_init_env() {
    ...
    if [[ -r /proc/sys/kernel/hostname ]]; then
        read HOSTNAME < /proc/sys/kernel/hostname
    elif [[ -z "$HOSTNAME" ]]; then
        HOSTNAME="localhost"
    fi
    HOSTNAME_S=${HOSTNAME/%.*}
    ...
}
function nhcmain_mark_online() {
    if [[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 ]]; then
        eval $ONLINE_NODE "'$HOSTNAME'"
    fi
}

But slurm can be configured to use short hostname. And when it is telled to use short one, it does not recognize the long hostname. In such case nhc will fail to co-operate with slurm:

> tail /var/log/nhc.log
20211216 14:43:52 [slurm] /usr/libexec/nhc/node-mark-online node838.pi.sjtu.edu.cn
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on node838.pi.sjtu.edu.cn

From sinfo it's reason is obvious:

> hostname
node838.pi.sjtu.edu.cn
> sinfo --node=node838
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
lencpu*      up 3-00:00:00      1   idle node838
> sinfo --node=node838.pi.sjtu.edu.cn
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
lencpu*      up 3-00:00:00      0    n/a

So I suggest nhc to implement an option to let user choose which hostname format should be used by nhc to co-operate with slurm.

mej commented 11 months ago

I totally agree here! As it happens, I've opened an Issue for this (#129) that lays out my thoughts and potential avenues to address this. I'm going to close this one in favor of that one, but I'd love to hear any feedback you might have on the proposals I've made in #129! 😀