jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 61 forks source link

Add command line options to disable specific collectors #60

Open ostertagconrad opened 1 year ago

ostertagconrad commented 1 year ago

When we scrape some of our redfish server one scrape takes up to 40 seconds.

We tried to disabled some collectors which information/metrics we don't need and got down to like 10 to 20 seconds. To disable we just removed them in the source code and build the exporter.

It would be very helpful if we could disable specific collectors via command line option or by config file. At start it will be maybe enough to disable one or two of the three collectors (manager, chassis, system). In the next step a more fine grained configuration to disable for example memory and storage in the system collector would be nice but definitely more work to implement.

stmcginnis commented 1 year ago

Out of curiosity, do you happen to have the names of the resources you removed that were taking a long time? There is some work recently added to make collection fetching parallel in the gofish library. Right now it is limited to one type that a user saw a need to speed up, but if there are a set of resources that generally take a longer time to collect, maybe we can expand that pattern in the library to speed up some of these other cases.

ostertagconrad commented 1 year ago

We now disabled the whole system collector and the scrape duration went down from 40 to 4 seconds. A colleague did some more testing and found out, that the PCI (functions and devices) did need the most time (18 and 13 seconds). Probably because our servers have like 60 PCI devices and 80 PCI functions. We don't know if the the exporter does one long API call to collect all the info or a lot of fast calls. But the many PCI devices and functions seem to be the main problem.

tazend commented 1 year ago

Hi, I'm the colleague of @ostertagconrad

yeah especially the PCI stuff took a long time. As @ostertagconrad said because our Servers have so many of them and it seems like each PCI Device/Function has to be fetched with a seperate API-Call each.

Just for reference, this is the place in the gofish library, where for each PCIDevice a seperate call is made - I guess fetching this in parallel will definitely reduce execution time a lot.

stmcginnis commented 1 year ago

No worries if this isn't something you have time for, but would love it if you could try an updated gofish with this change to see if it makes anything better.

tazend commented 1 year ago

Thanks @stmcginnis , this looks good. I will try it out with our servers some time this week.

tazend commented 1 year ago

Hi @stmcginnis

I was now trying to test out your changes, but ran into some problems. The redfish-exporter uses the functions PCIeDevices and PCIeFunctions from here which contains seperate implementations to get all the pcie devices / pcie functions of the systems, instead of using the one you updated in pciedevices.go

In that same place I then tried to replace the code to simply use the ListReferencesPCIeDevices function in the same manner the other functions do.

However the problem is, that computersystems.pcieDevices is a []string, and ListReferencedPCIeDevices expects just a string - so I don't know what exactly should be passed to it then and how to fix it.

I assume the computersystems.pcieDevices member could be updated to just be a string? (which is then simply just the link to the PCIeDevice ressource in the REST-API?)

stmcginnis commented 1 year ago

@tazend sorry for taking so long to get back to this. I've updated https://github.com/stmcginnis/gofish/pull/210 to make all collection retrieval happen with some parallelism. Would you be able to try out these changes?

tazend commented 1 year ago

Hi @stmcginnis, no worries - looks good! I will try to check it very soon.

rfpronk commented 1 year ago

I would love this feature as well. We are getting all kinds of errors from our Dell and HPe servers from components of which we don't need the metrics anyway. So ideally the vendor fixes their firmware but pragmatically disabling it solves the problem for me. I'm now using a custom build of the exporter and the gofish library that disables most components and that works fine.

I've also included the branch/PR that makes fetching work in parallel and that seems to work fine, but for me it just improves the speed instead of solving my problem.

stmcginnis commented 1 year ago

Hey @tazend, just wanted to check if you ever had a chance to try out the changes. I may go ahead and merge it and watch for any reported issues, but wanted to quick check back here first. Thanks!

stmcginnis commented 1 year ago

Or @rfpronk - you mention using a fork with these changes included. Can you confirm things are working as expected against your hardware?

tazend commented 1 year ago

Hi @stmcginnis

sorry, I haven't tried out the changes yet - but I plan to do so soon. I'll let you know.

rfpronk commented 1 year ago

Or @rfpronk - you mention using a fork with these changes included. Can you confirm things are working as expected against your hardware?

You mean https://github.com/stmcginnis/gofish/pull/210? If so yes, the custom build I mention above does include that PR and that works fine but only improves speed and doesn't solve the errors that I get with specific collectors (where this issue is about)