prometheus-community / ipmi_exporter

Remote IPMI exporter for Prometheus
MIT License
459 stars 131 forks source link

Added Power Measurement State to DCMI collector #160

Closed k0ste closed 1 year ago

k0ste commented 1 year ago

Added DCMI metric to definitely determine avail of Power Measurement feature

In some cases, for example with cheap Power Supply Units power consumption may be not available at all, to control this added ipmi_dcmi_power_measurement_state metric. The value is 1 for power measure is available on, and 0 otherwise.

The main point for this changes is a try to knowledge: the measurements isn't accessed due PSU broken (FW/cable/etc issues) or not avail at all on system level (and system know about that)

The host with "normal PSU":

root@host# curl -Ss "http://127.0.0.1:9291/ipmi?module=default&target=10.249.0.9" | grep -i dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 353
# HELP ipmi_dcmi_power_measurement_state Availability of Power Measurement (0=false, 1=true).
# TYPE ipmi_dcmi_power_measurement_state gauge
ipmi_dcmi_power_measurement_state 1
ipmi_up{collector="dcmi"} 1

The host with "strange PSU's"

root@host# curl -Ss "http://127.0.0.1:9291/ipmi?module=default&target=10.249.0.6" | grep -i dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 0
# HELP ipmi_dcmi_power_measurement_state Availability of Power Measurement (0=false, 1=true).
# TYPE ipmi_dcmi_power_measurement_state gauge
ipmi_dcmi_power_measurement_state 0
ipmi_up{collector="dcmi"} 1
k0ste commented 1 year ago

Ping @bitfehler

bitfehler commented 1 year ago

Hi @k0ste, sorry and thanks for the ping. Thanks for the patch! In general, this is a welcome addition. One issue though: I'd much prefer the metric to simply not be there if it is unsupported. Could this be done, or is there a reason why the additional metric is needed? For example, is it possible for the ipmi_dcmi_power_measurement_state to change in the lifetime of a PSU? From what I understand it just depends on the hardware and never changes? If so, I'd prefer using the result of the availability check to simply drop the power consumption metric completely if not available.

k0ste commented 1 year ago

Hi, @bitfehler! When I was imagine, how's better to handle this, I'm also thinked to this side, but:

  1. Zero watts is also a metric, i.e. 0 != Nan
  2. This will be breaking changes - some teams may use ipmi_dcmi_power_consumption_watts == 0 as query, and this queries should be rewritten (to unless(), I guess)
  3. If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero, this will not help to debug: why the power metrics was absent or lowered:
    • This because the "Intel Cloud Redundancy" is not setup'ed? [by default, Intel servers use power only from one PSU, we use raw ipmi command for turn on "shared mode"
    • This because something bad between PSU and Motherboard? In case, when ipmi_dcmi_power_measurement_state == 1 & ipmi_dcmi_power_consumption_watts == 0
    • This because measurement impossible on system level? In case, when ipmi_dcmi_power_measurement_state == 0 & ipmi_dcmi_power_consumption_watts == 0

Thanks!

bitfehler commented 1 year ago

Zero watts is also a metric, i.e. 0 != Nan

Absolutely!

3. If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero

I was actually thinking more like this: do perform the test for active/not available that you added, if "active" output metric regardless of value, if not available drop metric. Shouldn't that do?

k0ste commented 1 year ago

I was actually thinking more like this: do perform the test for active/not available that you added, if "active" output metric regardless of value, if not available drop metric. Shouldn't that do?

The 3rd point ā˜ļø exactly about this idea šŸŒš

bitfehler commented 1 year ago

I am not sure I understand. Your 3rd point says

If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero

but my suggestion is to only drop it if the new test you added in this PR says not available (i.e. when ipmi_dcmi_power_measurement_state would be 0).

As I understand it, with your current patch, the decision table would be:

ipmi_dcmi_power_consumption_watts ipmi_dcmi_power_measurement_state Deduction
0 0 BMC/PSU don't support it
0 1 something is amiss, investigation needed
>0 0 impossible
>0 1 metric as intended

Or am I missing something?

My suggestion would simply be to translate this into the following decision table:

ipmi_dcmi_power_consumption_watts Deduction
N/A BMC/PSU don't support it
0 something is amiss, investigation needed
> 0 metric as intended

Note the absence of an "impossible" state, hence I think it would make for a better solution. Sorry if I am misunderstanding s/t, maybe you can elaborate on your third point?

k0ste commented 1 year ago

Note the absence of an "impossible" state, hence I think it would make for a better solution. Sorry if I am misunderstanding s/t, maybe you can elaborate on your third point?

You understanding is totally correctly. (But) If we translate "my table" to "your table" - this will be breaking changes in current how ipmi_exporter performed, where ipmi_dcmi_power_consumption_watts is always present

k0ste commented 1 year ago

If the project okay for small breaking changes, I can prepare the release 1.7 with notes to CHANGELOG šŸ™‚

bitfehler commented 1 year ago

@k0ste a breaking change is totally fine. can you update the patch accordingly? thanks a lot!

k0ste commented 1 year ago

@bitfehler updated! The checks:

Host with Power Measurement

root@host# curl -Ss "http://localhost:9291/ipmi?module=default&target=10.249.0.9" | grep -i power
# HELP ipmi_chassis_power_state Current power state (1=on, 0=off).
# TYPE ipmi_chassis_power_state gauge
ipmi_chassis_power_state 1
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts (no value if Power Measurement is not avail)
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 372
ipmi_sensor_state{id="4024",name="PS1 Status",type="Power Supply"} 0
ipmi_sensor_state{id="4091",name="PS2 Status",type="Power Supply"} 0
ipmi_sensor_value{id="4024",name="PS1 Status",type="Power Supply"} NaN
ipmi_sensor_value{id="4091",name="PS2 Status",type="Power Supply"} NaN

Host without Power Measurement

root@host# curl -Ss "http://localhost:9291/ipmi?module=default&target=10.249.0.6" | grep -i power
# HELP ipmi_chassis_power_state Current power state (1=on, 0=off).
# TYPE ipmi_chassis_power_state gauge
ipmi_chassis_power_state 1
ipmi_sensor_state{id="4158",name="PS1 Status",type="Power Supply"} 0
ipmi_sensor_state{id="4225",name="PS2 Status",type="Power Supply"} 0
ipmi_sensor_value{id="4158",name="PS1 Status",type="Power Supply"} NaN
ipmi_sensor_value{id="4225",name="PS2 Status",type="Power Supply"} NaN
k0ste commented 1 year ago

ping @bitfehler

k0ste commented 1 year ago

Changes maden šŸ™