Closed Supermathie closed 3 years ago
I also noticed this a few days ago when trying out v1.3.0 for the first time. I initially thought it was just one extremely old SMC board, but it seems to occur on a wider range of systems than expected.
Thanks for the report! This is certainly a bit problematic, I had not thought of this scenario. I will try to release a fixed version ASAP
@dswarbrick FWIW, the boards for which bmc-info is not reporting a System Firmware Version
are brand new.
Released 1.3.1 - sorry it took so long!
Just because we love IPMI so much... I have an old SMC host that does not seem to like running bmc-info
without the --get-device-id
param, which was removed in https://github.com/soundcloud/ipmi_exporter/commit/2c6a73b1510cb1b6095feb5f18ed5d1da9823616#diff-f1c80b360812fd52bef43d56f3c4ef40853f02711810857332ccedfd9f463f79L291
$ bmc-info -l user -u MONITOR -p $password -h tank-ipmi
Device ID : 32
Device Revision : 1
Device SDRs : unsupported
Firmware Revision : 3.29
Device Available : yes (normal operation)
IPMI Version : 2.0
Sensor Device : supported
SDR Repository Device : supported
SEL Device : supported
FRU Inventory Device : supported
IPMB Event Receiver : supported
IPMB Event Generator : supported
Bridge : unsupported
Chassis Device : supported
Manufacturer ID : Nedam ENG. Co., ltd. (47488)
Product ID : 43025
Device GUID : 00000000-0000-0000-0000-000000000000
System GUID : 00000000-dfad-2990-2500-534d31303037
ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient
The fact that bmc-info
exits with status code 1 means that the collector is treating it as a failed command, which is fair enough. But it results in the ipmi_bmc_info metric disappearing for that host. We don't really want to use a higher privilege level for monitoring.
Edit: Running bmc-info
without --get-device-id
, or running it explicitly with --get-system-info
will trigger the failure. Also, using a higher privilege level user fails, with the same error, e.g.
$ bmc-info -l ADMIN -u ADMIN -p $adminpw -h tank-ipmi --get-system-info
ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient
Versions of freeipmi tested: 1.6.3, 1.6.4.
I discovered about another 40 more systems which now fail the bmc-info
command. I did not previously see them because the ipmi_exporter was running with --log.level=fatal
.
In addition to the privilege level error in my previous comment, I've seen some fail with:
ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: command invalid or unsupported
Oh my. Thanks for the detailed report. Give me a moment to figure out how to solve this mess somewhat gracefully...
Any chance this could be resolved before debian bullseye freeze? I'd quite like to update the package there (which I co-maintain), but am hesitant to use 1.3.x versions, as they will require a bit of patching to work around some of the rough edges.
Just wanted to provide a quick update: I had to take a break from maintenance for a while for private matters, sorry for letting this sit for so long. I am tackling this now, is before the hard freeze still sufficient, or is it already too late?
@bitfehler So long as the diff isn't too huge, we can probably still get it into bullseye, since it's not a new package - it will just take a bit longer to transition. Give me a shout if you need any additional testing on a zoo of old(er) IPMI beasts.
@dswarbrick could you let me know if #65 seems like a reasonable solution that does not involve refactoring the heck out of the exporter? If you think it would help I would merge that and the (small) fix for #62 and then do another release, before doing a larger refactoring to account for the all the things i have learned about the dark sides of IPMI from maintaining this exporter for a while :innocent:
@bitfehler As I mentioned above, running bmc-info
without --get-device-id
results in a non-zero exit status for some devices, which makes the collector treat it as a failed command, despite the partially good output.
My hack that that I ended up applying to our inhouse packages was to explicitly run bmc-info <target> --get-device-id
, which should generally be fine, and additionally run bmc-info <target> --get-system-info
, which can potentially return non-zero - and in such cases, set the systemFirmwareVersion
var to "N/A".
The idea was that if you run into this issue, you just use the bmc-device-id
collector for these machines, which will only do --get-device-id
and set the system firmware version to "N/A"
. I would like to avoid doing two roundtrips for a single collector, as each one is pretty costly with IPMI.
I agree that two IPMI roundtrips is not ideal, however it's not always feasible to know which devices are going to fail without the --get-device-id
. And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape.
Part of the problem is the fact that ipmi_exporter treats the non-zero exit status of the bmc-info command as a completely failed command, despite the partially good output, as I wrote in https://github.com/soundcloud/ipmi_exporter/issues/57#issuecomment-714653779. Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command, and assume that bmc-info ... --get-system-info
was going to fail, and thus preemptively just export the system firmware version as "N/A".
And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape
Hrm, that is a good point indeed, seems quite impractical.
Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command
I was hoping to avoid something like that, but it might actually be worth a try...
@dswarbrick how do you feel about https://github.com/soundcloud/ipmi_exporter/tree/bitfehler/bmc-info-recover ?
@bitfehler That looks pretty okay to me.
Cool. Could I get you to give that a spin? I don't have any such hosts available that would trigger this path...
Looks mostly fine, however it's going to be a bit noisy at log.level=error
.
Feb 16 21:03:42 ber-prom1 prometheus-ipmi-exporter[31853]: time="2021-02-16T21:03:42Z" level=error msg="Error while calling bmc-info for tank-ipmi: Device ID : 32\nDevice Revision : 1\nDevice SDRs : unsupported\nFirmware Revision : 3.29\nDevice Available : yes (normal operation)\nIPMI Version : 2.0\nSensor Device : supported\nSDR Repository Device : supported\nSEL Device : supported\nFRU Inventory Device : supported\nIPMB Event Receiver : supported\nIPMB Event Generator : supported\nBridge : unsupported\nChassis Device : supported\nManufacturer ID : Nedam ENG. Co., ltd. (47488)\nProduct ID : 43025\n\nDevice GUID : 00000000-0000-0000-0000-000000000000\n\nSystem GUID : 00000000-dfad-2990-2500-534d31303037\n\nipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient\n" source="collector.go:278"
We're running most instances at log.level=fatal
already however.
Thanks for the feedback, I demoted the severity of that log to warning. It was misguided to log an error at that point anyways. Current master should work ok for you now. I'll push one more fix and then make another release, the changes will be rather small compared to the next release, so I hope that works for you to get it into Bullseye.
Super, thanks for your perseverance on this one!
On my Supermicro servers, I'm getting the following error with the bmc collector enabled:
As you can see the BMC on SMCI doesn't always provide a
System Firmware Version
- can we make this optional?