prometheus-community / ipmi_exporter

Remote IPMI exporter for Prometheus
MIT License
471 stars 133 forks source link

"Failed to parse bmc-info" when the system does not provide "System Firmware Version" #57

Closed Supermathie closed 3 years ago

Supermathie commented 4 years ago

On my Supermicro servers, I'm getting the following error with the bmc collector enabled:

ERRO[0004] Failed to parse bmc-info data from [local]: Could not find value in output: Device ID             : 32
Device Revision       : 1
Device SDRs           : unsupported
Firmware Revision     : 1.52
Device Available      : yes (normal operation)
IPMI Version          : 2.0
Sensor Device         : supported
SDR Repository Device : supported
SEL Device            : supported
FRU Inventory Device  : supported
IPMB Event Receiver   : supported
IPMB Event Generator  : supported
Bridge                : unsupported
Chassis Device        : supported
Manufacturer ID       : Super Micro Computer Inc. (10876)
Product ID            : 6937
Auxiliary Firmware Revision Information : 00000000h

Device GUID : 00000000-0000-0000-0000-000000000000

System GUID : 00000000-d43a-46ef-ec3c-534d31303136
…

As you can see the BMC on SMCI doesn't always provide a System Firmware Version - can we make this optional?

dswarbrick commented 4 years ago

I also noticed this a few days ago when trying out v1.3.0 for the first time. I initially thought it was just one extremely old SMC board, but it seems to occur on a wider range of systems than expected.

bitfehler commented 4 years ago

Thanks for the report! This is certainly a bit problematic, I had not thought of this scenario. I will try to release a fixed version ASAP

Supermathie commented 4 years ago

@dswarbrick FWIW, the boards for which bmc-info is not reporting a System Firmware Version are brand new.

bitfehler commented 4 years ago

Released 1.3.1 - sorry it took so long!

dswarbrick commented 4 years ago

Just because we love IPMI so much... I have an old SMC host that does not seem to like running bmc-info without the --get-device-id param, which was removed in https://github.com/soundcloud/ipmi_exporter/commit/2c6a73b1510cb1b6095feb5f18ed5d1da9823616#diff-f1c80b360812fd52bef43d56f3c4ef40853f02711810857332ccedfd9f463f79L291

$ bmc-info -l user -u MONITOR -p $password -h tank-ipmi
Device ID             : 32
Device Revision       : 1
Device SDRs           : unsupported
Firmware Revision     : 3.29
Device Available      : yes (normal operation)
IPMI Version          : 2.0
Sensor Device         : supported
SDR Repository Device : supported
SEL Device            : supported
FRU Inventory Device  : supported
IPMB Event Receiver   : supported
IPMB Event Generator  : supported
Bridge                : unsupported
Chassis Device        : supported
Manufacturer ID       : Nedam ENG. Co., ltd. (47488)
Product ID            : 43025

Device GUID : 00000000-0000-0000-0000-000000000000

System GUID : 00000000-dfad-2990-2500-534d31303037

ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient

The fact that bmc-info exits with status code 1 means that the collector is treating it as a failed command, which is fair enough. But it results in the ipmi_bmc_info metric disappearing for that host. We don't really want to use a higher privilege level for monitoring.

Edit: Running bmc-info without --get-device-id, or running it explicitly with --get-system-info will trigger the failure. Also, using a higher privilege level user fails, with the same error, e.g.

$ bmc-info -l ADMIN -u ADMIN -p $adminpw -h tank-ipmi --get-system-info
ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient

Versions of freeipmi tested: 1.6.3, 1.6.4.

dswarbrick commented 4 years ago

I discovered about another 40 more systems which now fail the bmc-info command. I did not previously see them because the ipmi_exporter was running with --log.level=fatal.

In addition to the privilege level error in my previous comment, I've seen some fail with:

ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: command invalid or unsupported
bitfehler commented 4 years ago

Oh my. Thanks for the detailed report. Give me a moment to figure out how to solve this mess somewhat gracefully...

dswarbrick commented 3 years ago

Any chance this could be resolved before debian bullseye freeze? I'd quite like to update the package there (which I co-maintain), but am hesitant to use 1.3.x versions, as they will require a bit of patching to work around some of the rough edges.

bitfehler commented 3 years ago

Just wanted to provide a quick update: I had to take a break from maintenance for a while for private matters, sorry for letting this sit for so long. I am tackling this now, is before the hard freeze still sufficient, or is it already too late?

dswarbrick commented 3 years ago

@bitfehler So long as the diff isn't too huge, we can probably still get it into bullseye, since it's not a new package - it will just take a bit longer to transition. Give me a shout if you need any additional testing on a zoo of old(er) IPMI beasts.

bitfehler commented 3 years ago

@dswarbrick could you let me know if #65 seems like a reasonable solution that does not involve refactoring the heck out of the exporter? If you think it would help I would merge that and the (small) fix for #62 and then do another release, before doing a larger refactoring to account for the all the things i have learned about the dark sides of IPMI from maintaining this exporter for a while :innocent:

dswarbrick commented 3 years ago

@bitfehler As I mentioned above, running bmc-info without --get-device-id results in a non-zero exit status for some devices, which makes the collector treat it as a failed command, despite the partially good output.

My hack that that I ended up applying to our inhouse packages was to explicitly run bmc-info <target> --get-device-id, which should generally be fine, and additionally run bmc-info <target> --get-system-info, which can potentially return non-zero - and in such cases, set the systemFirmwareVersion var to "N/A".

bitfehler commented 3 years ago

The idea was that if you run into this issue, you just use the bmc-device-id collector for these machines, which will only do --get-device-id and set the system firmware version to "N/A". I would like to avoid doing two roundtrips for a single collector, as each one is pretty costly with IPMI.

dswarbrick commented 3 years ago

I agree that two IPMI roundtrips is not ideal, however it's not always feasible to know which devices are going to fail without the --get-device-id. And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape.

Part of the problem is the fact that ipmi_exporter treats the non-zero exit status of the bmc-info command as a completely failed command, despite the partially good output, as I wrote in https://github.com/soundcloud/ipmi_exporter/issues/57#issuecomment-714653779. Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command, and assume that bmc-info ... --get-system-info was going to fail, and thus preemptively just export the system firmware version as "N/A".

bitfehler commented 3 years ago

And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape

Hrm, that is a good point indeed, seems quite impractical.

Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command

I was hoping to avoid something like that, but it might actually be worth a try...

bitfehler commented 3 years ago

@dswarbrick how do you feel about https://github.com/soundcloud/ipmi_exporter/tree/bitfehler/bmc-info-recover ?

dswarbrick commented 3 years ago

@bitfehler That looks pretty okay to me.

bitfehler commented 3 years ago

Cool. Could I get you to give that a spin? I don't have any such hosts available that would trigger this path...

dswarbrick commented 3 years ago

Looks mostly fine, however it's going to be a bit noisy at log.level=error.

Feb 16 21:03:42 ber-prom1 prometheus-ipmi-exporter[31853]: time="2021-02-16T21:03:42Z" level=error msg="Error while calling bmc-info for tank-ipmi: Device ID             : 32\nDevice Revision       : 1\nDevice SDRs           : unsupported\nFirmware Revision     : 3.29\nDevice Available      : yes (normal operation)\nIPMI Version          : 2.0\nSensor Device         : supported\nSDR Repository Device : supported\nSEL Device            : supported\nFRU Inventory Device  : supported\nIPMB Event Receiver   : supported\nIPMB Event Generator  : supported\nBridge                : unsupported\nChassis Device        : supported\nManufacturer ID       : Nedam ENG. Co., ltd. (47488)\nProduct ID            : 43025\n\nDevice GUID : 00000000-0000-0000-0000-000000000000\n\nSystem GUID : 00000000-dfad-2990-2500-534d31303037\n\nipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient\n" source="collector.go:278"

We're running most instances at log.level=fatal already however.

bitfehler commented 3 years ago

Thanks for the feedback, I demoted the severity of that log to warning. It was misguided to log an error at that point anyways. Current master should work ok for you now. I'll push one more fix and then make another release, the changes will be rather small compared to the next release, so I hope that works for you to get it into Bullseye.

dswarbrick commented 3 years ago

Super, thanks for your perseverance on this one!