Closed bobykus31 closed 10 months ago
Hi there! I certainly understand your desire here... unfortunately, there is certainly no "nice" way of processing the SEL. For this specific case, I would recommend to turn to something that collects the Linux Machine Check Exceptions (MCEs) instead. Unlike IPMI this will of course require a system to be bootable, but I suppose if it doesn't boot anymore, looking at the SEL is the least of your problems.
I think this issue for the node exporter will give you some good pointers: https://github.com/prometheus/node_exporter/issues/986
I will keep this issue open for a while to use it as a reminder to refresh my memory about the structured-ness of the SEL, but I honestly don't see such functionality being added to this exporter, as it opens several cans of worms. If you pursue this problem by other means, such as the linked issue, I would appreciate some feedback on if and how you got it to work, as I am sure it might be of interest to others. Thanks a lot!
Any thoughts on the very simple approach of reporting the number of existing elements in the SEL? Surely not ideal for every case but at least an indicator.
@4xoc that would certainly be a good fit for a prometheus metric. I will take a look if there is any better way to get this than just dumping the entire SEL.
I released version 1.2.0, which includes a new sel
collector (see the README). Gives you the number of entries in the SEL and the amount of free space still left in the SEL, both valuable metrics I think. Would be interested in any feedback if you get a chance to give it a try.
Unfortunately some data is not available directly from sensors but rather from their log ipmi-sel. Most interesting for me is "Memory Status" data, where you can get info about "Uncorrectable memory error" but may be more info. Is there any nice way to pull this data and user it in to prometheus?