prometheus-community / ipmi_exporter

Remote IPMI exporter for Prometheus
MIT License
459 stars 131 forks source link

Bmc watchdog #176

Closed agaoglu closed 10 months ago

agaoglu commented 10 months ago

Our supermicro BMC's provide a watchdog functionality, i.e. taking some specified action if a timer is not reset within a specified time. freeipmi tools have a bmc-watchdog command to control and also report the current status of such function. This collector reports that information.

An example output for bmc-watchdog from freeipmi:

$ sudo bmc-watchdog --get
Timer Use:                   BIOS FRB2
Timer:                       Running
Logging:                     Enabled
Timeout Action:              Power Cycle
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        1 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Clear
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           600 seconds
Current Countdown:           541 seconds

I've ignored the timer use clear flags for now. Others seem to work

# HELP ipmi_bmc_watchdog_current_countdown_seconds Watchdog initial countdown in seconds
# TYPE ipmi_bmc_watchdog_current_countdown_seconds gauge
ipmi_bmc_watchdog_current_countdown_seconds 584
# HELP ipmi_bmc_watchdog_initial_countdown_seconds Watchdog initial countdown in seconds
# TYPE ipmi_bmc_watchdog_initial_countdown_seconds gauge
ipmi_bmc_watchdog_initial_countdown_seconds 600
# HELP ipmi_bmc_watchdog_logging_state Watchdog log flag (1: Enabled, 0: Disabled / note: reverse of freeipmi)
# TYPE ipmi_bmc_watchdog_logging_state gauge
ipmi_bmc_watchdog_logging_state 1
# HELP ipmi_bmc_watchdog_pretimeout_interrupt_state Watchdog pre-timeout interrupt (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_pretimeout_interrupt_state gauge
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="Messaging Interrupt"} 0
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="NMI / Diagnostic Interrupt"} 0
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="None"} 1
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="SMI"} 0
# HELP ipmi_bmc_watchdog_pretimeout_interval_seconds Watchdog pre-timeout interval in seconds
# TYPE ipmi_bmc_watchdog_pretimeout_interval_seconds gauge
ipmi_bmc_watchdog_pretimeout_interval_seconds 1
# HELP ipmi_bmc_watchdog_timeout_action_state Watchdog timeout action (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_timeout_action_state gauge
ipmi_bmc_watchdog_timeout_action_state{action="Hard Reset"} 0
ipmi_bmc_watchdog_timeout_action_state{action="None"} 0
ipmi_bmc_watchdog_timeout_action_state{action="Power Cycle"} 1
ipmi_bmc_watchdog_timeout_action_state{action="Power Down"} 0
# HELP ipmi_bmc_watchdog_timer_state Watchdog timer running (1: running, 0: stopped)
# TYPE ipmi_bmc_watchdog_timer_state gauge
ipmi_bmc_watchdog_timer_state 1
# HELP ipmi_bmc_watchdog_timer_use_state Watchdog timer use (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_timer_use_state gauge
ipmi_bmc_watchdog_timer_use_state{name="BIOS FRB2"} 1
ipmi_bmc_watchdog_timer_use_state{name="BIOS POST"} 0
ipmi_bmc_watchdog_timer_use_state{name="OEM"} 0
ipmi_bmc_watchdog_timer_use_state{name="OS LOAD"} 0
ipmi_bmc_watchdog_timer_use_state{name="SMS/OS"} 0
bitfehler commented 10 months ago

Hi there,

first of all, thanks a lot. This certainly looks interesting and very comprehensive (docs! :raised_hands:). Code also looks pretty good at first glance, but please give me a bit more time to review the details.

For now, I'd be curious: how did you determine the possible fixed values (e.g. watchdogTimerUses, watchdogTimeoutActions, etc.)? Is this defined in the IPMI spec? Or documented somewhere?

agaoglu commented 10 months ago

Hi

I got them from freeipmi documentation for bmc-watchdog command. There are some differences between the values written in the manual and the way they’re reported though. I had to try each one by setting and reading the output again.

On Fri, Nov 10, 2023 at 17:50 Conrad Hoffmann @.***> wrote:

Hi there,

first of all, thanks a lot. This certainly looks interesting and very comprehensive (docs! 🙌). Code also looks pretty good at first glance, but please give me a bit more time to review the details.

For now, I'd be curious: how did you determine the possible fixed values (e.g. watchdogTimerUses, watchdogTimeoutActions, etc.)? Is this defined in the IPMI spec? Or documented somewhere?

— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/ipmi_exporter/pull/176#issuecomment-1805879463, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADI25OYXVBRZJAVW357MQTYDY5JVAVCNFSM6AAAAAA7BCZKL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBVHA3TSNBWGM . You are receiving this because you authored the thread.Message ID: @.***>

bitfehler commented 10 months ago

Ok, thanks for your patience, I think this looks pretty good. Could you kindly sign off your commits (git rebase --signoff) so that the tests are happy?

Thanks a lot!

agaoglu commented 10 months ago

I guess that clears it :)

Thank you.

bitfehler commented 10 months ago

Thanks a lot! :tada: