jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 61 forks source link

Add redfish_chassis_temperature_sensor_health_state metric #73

Closed ulikl closed 9 months ago

ulikl commented 10 months ago

Hi,

The current temperature metrics looks like

redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="CPU1 Temp", sensor_id="0"} 37
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="CPU2 Temp", sensor_id="1"} 32
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board Exhaust Temp", sensor_id="4"} 30
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board GPU7 Temp", sensor_id="3"} 32
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board Inlet Temp", sensor_id="2"} 19
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="CPU1 Temp", sensor_id="0"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="CPU2 Temp", sensor_id="1"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board Exhaust Temp", sensor_id="4"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board GPU7 Temp", sensor_id="3"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature",  sensor="System Board Inlet Temp", sensor_id="2"} 1

Note: for the test I set the Warning threshold for sensor "System Board Inlet Temp" to 17. The only state/health metrics > 1 in this case are:

redfish_system_health_state{cluster="steyr-prod-gpu",environment="prod",instance="steyr-prod-gpu__lp05edge02008",job="redfish-exporter",node="lp05edge02008",prometheus="victoriametrics/central",resource="system",scrape_from="edge-tooling",system_id="System.Embedded.1"} 2
redfish_chassis_health{chassis_id="System.Embedded.1",cluster="steyr-prod-gpu",environment="prod",instance="steyr-prod-gpu__lp05edge02008",job="redfish-exporter",node="lp05edge02008",prometheus="victoriametrics/central",resource="chassis",scrape_from="edge-tooling"} 2

So we in this case, when can only get a unspecific Chassis alert or need to define a Alert on the redfish_chassis_temperature_celsius using separate thresholds int the alert definition, which might not match the server configurations.

But the at least for our Dell servers also a Health value is provided via: https:///redfish/v1/Chassis/System.Embedded.1/Sensors/SystemBoardInletTemp

e.g. for

{
    "@odata.context": "/redfish/v1/$metadata#Sensor.Sensor",
    "@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Sensors/SystemBoardInletTemp",
    "@odata.type": "#Sensor.v1_5_0.Sensor",
    "Name": "System Board Inlet Temp",
    "Id": "SystemBoardInletTemp",
    "Description": "Instance of Sensor Id",
    "ReadingType": "Temperature",
    "ReadingUnits": "Cel",
    "Status": {
        "Health": "Warning",
        "State": "Enabled"
    },
    "Reading": 20.0,
   ...
}

Can the redfish_exporter be extended by such a temperature health metric?

jenningsloy318 commented 9 months ago

I checked the code, we have redfish_chassis_temperature_celsius and redfish_chassis_temperature_sensor_state, but we don't have redfish_chassis_temperature_sensor_health, I will check if we can add redfish_chassis_temperature_sensor_health

jenningsloy318 commented 9 months ago

@ulikl latest commit add such metric, please build and test since I don't have device

ulikl commented 9 months ago

@jenningsloy318 , Thank you very much. Its working

# HELP redfish_chassis_temperature_celsius celsius of temperature on this chassis component
# TYPE redfish_chassis_temperature_celsius gauge
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 36
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 36
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 37
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 27
# HELP redfish_chassis_temperature_sensor_health status health of temperature on this chassis component,1(Enabled),2(Disabled),3(StandbyOffinline),4(StandbySpare),5(InTest),6(Starting),7(Absent),8(UnavailableOffline),9(Deferring),10(Quiesced),11(Updating)
# TYPE redfish_chassis_temperature_sensor_health gauge
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 1

With inlet over warning:

# TYPE redfish_chassis_temperature_sensor_health gauge
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 2
fschlich commented 9 months ago

if "2" means Warning, the HELP text is wrong, should be CommonHealthHelp instead of CommonStateHelp, no?