thomas-krenn / check_ipmi_sensor_v3

Monitoring plugin to check IPMI sensors
https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin
GNU General Public License v3.0
54 stars 21 forks source link

What is BP0? #51

Closed maltewhiite closed 3 years ago

maltewhiite commented 3 years ago

The guy who set all this up, doesn't work here anymore, so nobody knows what this NAGIOS alert means. It says IPMI Status: Critical [BP0 Presence = Critical] And the "check_ipmi_sensor" command says IPMI Status: Critical [Presence = Critical, Presence = Critical, BP0 Presence = Critical] | 'Current Power'=366 'Temp'=63.00 'Temp'=76.00 'Inlet Temp'=20.00;3.00:33.00;-7.00:37.00 'Fan1'=6120.00;840.00:;480.00: 'Fan2'=6120.00;840.00:;480.00: 'Fan3'=6240.00;840.00:;480.00: 'Fan4'=6360.00;840.00:;480.00: 'Fan5'=6240.00;840.00:;480.00: 'Fan6'=6240.00;840.00:;480.00: 'Current 1'=0.80 'Current 2'=0.80 'Voltage 1'=486.00 'Voltage 2'=486.00 'Pwr Consumption'=726.00;~:2354.00;~:2596.00 'IO Usage'=0.00;~:101.00; 'MEM Usage'=0.00;~:101.00; 'SYS Usage'=4.00;~:101.00; 'CPU Usage'=4.00;~:101.00; 'Exhaust Temp'=44.00;8.00:75.00;3.00:80.00

Assume I know nothing about hardware. I just have a software education, and was suddenly tasked with taking over the Nagios monitoring.

What is BP0?

veitw commented 3 years ago

Hi maltewhiite,

the identifiers do not depend on check_ipmi_sensor but on the system/mainboard/BMC vendor's definitions. This also applies to the criticality, which is also often set quite adventurously by some vendors.

You might gain additional information by either having a look into the BMC by either IPMI, SSH or web interface. For many vendors, BP is an abbreviation for backplane and usually refers to a storage backplane.

Also some vendors push presence assertions for some or even all supported but optional and therefore maybe not present hardware components to the System Event Log/SEL either at Power-on Self Test/POST or when upgrading the system firmware/UEFI or BMC/IPMI controller firmware, even if such hardware options were never connected to the system before.

I suppose, this is the case here. Thus after checking the BMC/IPMI controller that there is no persisting problem, you could make the alert disappear by deleting the corresponding event(s) from the SEL and/or emptying the SEL.

If you are not content with cleaning the SEL after reboots or firmware upgrades I'd advise to check the BMC/IPMI controller whether it can be configured to never assume that hardware not currently present is missing. I STRONGLY advise to never disable SEL monitoring in check_ipmi_sensor, as the SEL is monitored by default to alert you about non-persistent errors, such as unreliable power supplies or power cabling, corrected RAM errors that might not provoke an MCE, failed components without own sensor values, ...

HtH and best regards, // Veit

Am Donnerstag, dem 25.11.2021 um 05:52 -0800 schrieb maltewhiite:

The guy who set all this up, doesn't work here anymore, so nobody knows what this NAGIOS alert means. It says IPMI Status: Critical [BP0 Presence = Critical] And the "check_ipmi_sensor" command says IPMI Status: Critical [Presence = Critical, Presence = Critical, BP0 Presence = Critical] | 'Current Power'=366 'Temp'=63.00 'Temp'=76.00 'Inlet Temp'=20.00;3.00:33.00;-7.00:37.00 'Fan1'=6120.00;840.00:;480.00: 'Fan2'=6120.00;840.00:;480.00: 'Fan3'=6240.00;840.00:;480.00: 'Fan4'=6360.00;840.00:;480.00: 'Fan5'=6240.00;840.00:;480.00: 'Fan6'=6240.00;840.00:;480.00: 'Current 1'=0.80 'Current 2'=0.80 'Voltage 1'=486.00 'Voltage 2'=486.00 'Pwr Consumption'=726.00;~:2354.00;~:2596.00 'IO Usage'=0.00;~:101.00; 'MEM Usage'=0.00;~:101.00; 'SYS Usage'=4.00;~:101.00; 'CPU Usage'=4.00;~:101.00; 'Exhaust Temp'=44.00;8.00:75.00;3.00:80.00 What is BP0?

maltewhiite commented 3 years ago

Thanks a lot! I will forward this to our hardware team.