thomas-krenn / check_ipmi_sensor_v3

Monitoring plugin to check IPMI sensors
https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin
GNU General Public License v3.0
54 stars 21 forks source link

error not clearing after applying fix #38

Open Dragao75 opened 4 years ago

Dragao75 commented 4 years ago

We just tested this script and it works pretty good.

But I may misunderstood a part of the script how it works.

Situation: We remove one of the PSU from the system. We receive a critical error that there is a problem with the PSU (ofcourse and expected)

now we reinsert the PSU but we do not receive an OK. The message is still critical. I did see a log entry saying "PSU Failure: PSU asserted. " My guess is the check is triggered by the word " failure"

Clearing the log on the device set things to green..

Is there some way around this?

Output with PSU inserted back: IPMI Status: Critical [Power Supply 2 PS2 Status = Critical (Power Supply), Power Supply 2 PS2 Status = Critical (Power Supply)] | 'CPU1 Temp'=41.00;0.00:85.00;0.00:88.00 'CPU2 Temp'=35.00;0.00:90.00;0.00:93.00 'System Temp'=30.00;-5.00:80.00;-7.00:85.00 'Peripheral Temp'=53.00;-5.00:80.00;-7.00:85.00 'PCH Temp'=59.00;-5.00:90.00;-8.00:95.00 '10G Temp'=65.00;-5.00:90.00;-8.00:95.00 'P1-DIMMA TEMP'=34.00;4.00:80.00;2.00:85.00 'P1-DIMMB TEMP'=35.00;4.00:80.00;2.00:85.00 'P2-DIMME TEMP'=32.00;4.00:80.00;2.00:85.00 'P2-DIMMF TEMP'=32.00;4.00:80.00;2.00:85.00 'FAN1'=12200.00;700.00:25500.00;500.00:25500.00 'FAN2'=12200.00;700.00:25500.00;500.00:25500.00 'FAN3'=12100.00;700.00:25500.00;500.00:25500.00 'FAN4'=12300.00;700.00:25500.00;500.00:25500.00 'FANA'=5600.00;700.00:25500.00;500.00:25500.00 'FANB'=5600.00;700.00:25500.00;500.00:25500.00 'FANC'=5500.00;700.00:25500.00;500.00:25500.00 'FAND'=5500.00;700.00:25500.00;500.00:25500.00 'FANE'=5500.00;700.00:25500.00;500.00:25500.00 'FANF'=5500.00;700.00:25500.00;500.00:25500.00 'VTT'=0.99;0.91:1.34;0.86:1.39 'CPU1 Vcore'=0.91;0.54:1.49;0.51:1.52 'CPU2 Vcore'=0.99;0.54:1.49;0.51:1.52 'VDIMM AB'=1.49;1.20:1.65;1.15:1.70 'VDIMM CD'=1.49;1.20:1.65;1.15:1.70 'VDIMM EF'=1.49;1.20:1.65;1.15:1.70 'VDIMM GH'=1.49;1.20:1.65;1.15:1.70 '3.3V'=3.26;2.93:3.65;2.78:3.79 '+3.3VSB'=3.36;2.93:3.65;2.78:3.79 '5V'=5.06;4.48:5.50;4.29:5.70 '+5VSB'=5.06;4.48:5.50;4.29:5.70 '12V'=11.98;10.81:13.25;10.49:13.57 'VBAT'=3.17;2.69:3.55;2.54:3.70 'GPU2 Temp'=31.00;-5.00:85.00;-8.00:90.00 'GPU4 Temp'=32.00;-5.00:85.00;-8.00:90.00 CPU1 Temp = 41.00 (Status: Nominal) CPU2 Temp = 35.00 (Status: Nominal) System Temp = 30.00 (Status: Nominal) Peripheral Temp = 53.00 (Status: Nominal) PCH Temp = 59.00 (Status: Nominal) 10G Temp = 65.00 (Status: Nominal) P1-DIMMA TEMP = 34.00 (Status: Nominal) P1-DIMMB TEMP = 35.00 (Status: Nominal) P2-DIMME TEMP = 32.00 (Status: Nominal) P2-DIMMF TEMP = 32.00 (Status: Nominal) FAN1 = 12200.00 (Status: Nominal) FAN2 = 12200.00 (Status: Nominal) FAN3 = 12100.00 (Status: Nominal) FAN4 = 12300.00 (Status: Nominal) FANA = 5600.00 (Status: Nominal) FANB = 5600.00 (Status: Nominal) FANC = 5500.00 (Status: Nominal) FAND = 5500.00 (Status: Nominal) FANE = 5500.00 (Status: Nominal) FANF = 5500.00 (Status: Nominal) VTT = 0.99 (Status: Nominal) CPU1 Vcore = 0.91 (Status: Nominal) CPU2 Vcore = 0.99 (Status: Nominal) VDIMM AB = 1.49 (Status: Nominal) VDIMM CD = 1.49 (Status: Nominal) VDIMM EF = 1.49 (Status: Nominal) VDIMM GH = 1.49 (Status: Nominal) 3.3V = 3.26 (Status: Nominal) +3.3VSB = 3.36 (Status: Nominal) 5V = 5.06 (Status: Nominal) +5VSB = 5.06 (Status: Nominal) 12V = 11.98 (Status: Nominal) VBAT = 3.17 (Status: Nominal) GPU2 Temp = 31.00 (Status: Nominal) GPU4 Temp = 32.00 (Status: Nominal) HDD Status = 'OK' (Status: Nominal) Chassis Intru = 'OK' (Status: Nominal) PS1 Status = 'Presence detected' (Status: Nominal) PS2 Status = 'Presence detected' (Status: Nominal)

gschoenberger commented 4 years ago

Currently this is expected behavior. The error comes from the entries in the system event log of the IPMI interface (ipmi-sel). As you said clearing the log also sets things to green. Due to the fact that all ipmi-sel entries are parsed and checked for failures, as long as the IPMI Sel Log hast the PSU failure in it you will get a CRITICAL from the plugin. It is quite difficult to react to "Failure -> OK" situations as the plugin would have to persist some state about what is CRITICAL and what went to OK again across multiple plugin calls. And normally plugins are stateless and don't have any information present about passt plugin calls. Cheers, Georg