lovoo / ipmi_exporter

IPMI Exporter for prometheus.io, written in Go.
BSD 3-Clause "New" or "Revised" License
80 stars 32 forks source link

ipmi_explorer does not handle unresponsive BMCs properly #31

Open korekhov opened 6 years ago

korekhov commented 6 years ago

In case of unresponsive BMC, ipmi_exporter keeps on spawning ipmitool processes (so eventually will kill the system):

ps -ef | grep export

root 142647 1 0 13:47 ? 00:00:23 /opt/spc/ipmi_exporter root 142650 1 0 13:47 ? 00:00:29 /opt/spc/node_exporter

ps -ef | grep ipmitool | wc -l

655

ps -ef | grep ipmitool

root 154401 142647 0 16:19 ? 00:00:00 ipmitool raw 0x06 0x52 0x07 0x78 0x01 0x97 root 154402 142647 0 16:19 ? 00:00:00 ipmitool raw 0x06 0x52 0x07 0x7a 0x01 0x97 \<snip> root 156929 142647 0 16:46 ? 00:00:00 ipmitool sensor root 156934 142647 0 16:46 ? 00:00:00 ipmitool sensor #

On this particular machine I do have bad BMC (cold-resetting it though network does not help), which is not responding to local IPMI commands.

In some other tooling I use a predefined timeout value and when that timeout is reached I'd set a value of bmc_health fact/metric accordingly to indicate the status.

Perhaps similar approach could be used here to report bmc_up/bmc_down metric?

pmb311 commented 6 years ago

It would be great to get this fix soon.