thomas-krenn / check_ipmi_sensor_v3

Monitoring plugin to check IPMI sensors
https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin
GNU General Public License v3.0
54 stars 21 forks source link

Alert only on recent SEL entries #28

Open aieri opened 5 years ago

aieri commented 5 years ago

I like the current support for alerting on SEL entries, but I find it pushes us towards a suboptimal pattern: since we value such alerts, we are forced to keep the SEL empty. On the other hand, it can be convenient to retain a local history of failures for when you inevitably end up in one of those situations where you're stuck wondering whether you should look into a replacement because "isn't this the server that keeps crashing?". I'd propose a new config option to specify an age limit beyond which events are ignored. For example, I could say check_ipmi_sensors --selcutoff 24 to indicate that any SEL entry that is older than 1 day will be ignored.

Tejeev commented 5 years ago

I've been struggling with this same issue and tried documenting in comments in Thruk (but those quickly become unmanageable and sometimes get cleaned up without any other tracking), and cases (which become lost in the fog of caselog).

I think the age limit makes sense for us. Investigate the alert and downtime if innocuous and we keep the log while the alert becomes active again tomorrow.
That said, I forsee lots of these being downtimed for 24 hours so we might miss subsiquent failures during that time. Is it possible to have a check command that sets the current number of allowed SEL enteries to current so new ones will alert? Then we can use that command to affectively acknowledge the alerts once we've checked and retain history.

aieri commented 5 years ago

Though saying "alert if SEL has more than n entries" would generally work, it wouldn't really scale: every server would need a different value and would result in a configuration management nightmare. If you want to dream big, you could imagine alerting only for events that have not been deasserted. The problem there is that although some events do come in pair (e.g. voltage too high / ok it's fine now), others do not (e.g. CPU n threw some error), so you'd need to build a lot knowledge in this check. Doable, but quite a bit of work.

aieri commented 5 years ago

Ok, actually... there would be a third way to solve this: imagine a command like check_ipmi_sensors acksel that persisted the latest SEL entry on disk. Subsequent calls to check_ipmi_sensors would alert only if there are entries newer than the cached value. The downside is that this would make check_ipmi_sensors stateful, whereas nagios plugins are generally stateless. I don't know if the upstream devs would be ok with going down this route.

tniedermeier commented 5 years ago

Hello @aieri and @Tejeev,

I'm so sorry for my hugely delayed reply! Maybe you could give this option from ipmi-sel a try, just add the following parameter to your plugin call: --seloptions '--date-range=09/01/2019-now'

Using this parameter, the check_ipmi_sensor plugin only displays SEL entries and alerts for events occured in the time range from today back to the first of September.

If there were no SEL entries in that specified time range, the plugin returns OK. More options for ipmi-sel: http://manpages.ubuntu.com/manpages/trusty/man8/ipmi-sel.8.html#ipmi-sel%20options

Example:

$ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/10/2019-now' IPMI Status: OK | 'CPU Temp'=46.00;0.00:95.00;0.00:100.00 [...]

$ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/01/2019-now' IPMI Status: Critical [1 system event log (SEL) entry present] | 'CPU Temp'=47.00;0.00:95.00;0.00:100.00 [...]

I hope this helps.

Best regards, Thomas

afreiberger commented 4 years ago

I'm wondering if having an "acknowledged sel entries" file that gets passed to --selexclude might be the best way to keep entries in SEL but ignore them.

Obviously, anything wrapped around this check will need a way to populate and clear that file when issues have been mitigated.

Tejeev commented 4 years ago

@tniedermeier I'm afraid I've moved on and no longer use this tooling. I do remember this being a major pain for me when working in operational response so while I can't offer any constructive communication on this, I would like to note that I still strongly +1 the investigation for a resolution to save all those that still use it. The workflows that grew up around this definitely hurt operations and colored my view of this alerting solution.

BrixSat commented 1 year ago

Hello,

The feature to ignore the ACK events in the sel is a major thing. It should be able to show all events or only new ones. Its imperative to keep the sel with all events, so in the future we can know what was wrong with the machine.

Hope this helps.

graham-collinson commented 1 year ago

As a quick solution for now I'm using a wrapper script to only look at logs for today:

!/bin/bash

MM/DD/YYYY used by ipme-sel command

yest=`date +%m/%d/%Y -d yesterday` /usr/local/nagios/libexec/check_ipmi_sensor "$@" --seloptions "--date-range=$yest-now" exitcode=$? exit $exitcode

There's a chance that something will hit the log just before midnight and not get picked up.