On this particular machine I do have bad BMC (cold-resetting it though network does not help), which is not responding to local IPMI commands.
In some other tooling I use a predefined timeout value and when that timeout is reached I'd set a value of bmc_health fact/metric accordingly to indicate the status.
Perhaps similar approach could be used here to report bmc_up/bmc_down metric?
In case of unresponsive BMC, ipmi_exporter keeps on spawning ipmitool processes (so eventually will kill the system):
ps -ef | grep export
root 142647 1 0 13:47 ? 00:00:23 /opt/spc/ipmi_exporter root 142650 1 0 13:47 ? 00:00:29 /opt/spc/node_exporter
ps -ef | grep ipmitool | wc -l
655
ps -ef | grep ipmitool
root 154401 142647 0 16:19 ? 00:00:00 ipmitool raw 0x06 0x52 0x07 0x78 0x01 0x97 root 154402 142647 0 16:19 ? 00:00:00 ipmitool raw 0x06 0x52 0x07 0x7a 0x01 0x97 \<snip> root 156929 142647 0 16:46 ? 00:00:00 ipmitool sensor root 156934 142647 0 16:46 ? 00:00:00 ipmitool sensor #
On this particular machine I do have bad BMC (cold-resetting it though network does not help), which is not responding to local IPMI commands.
In some other tooling I use a predefined timeout value and when that timeout is reached I'd set a value of bmc_health fact/metric accordingly to indicate the status.
Perhaps similar approach could be used here to report bmc_up/bmc_down metric?