jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 61 forks source link

Occure Unexpected Panic Error( HPE Server) #64

Open sykim1009 opened 1 year ago

sykim1009 commented 1 year ago

While exceuting redfish_exporter, there are panic error in over 10,000 HPE nodes in my infra.

goroutine 351152 [running]: github.com/jenningsloy318/redfish_exporter/collector.parseEthernetInterface(0xc016200840, 0xc01bc7e878, 0x8, 0xc01ca5f600, 0xc0463ca040) /go/src/github.com/jenningsloy318/redfish_exporter/collector/system_collector.go:684 +0x465 created by github.com/jenningsloy318/redfish_exporter/collector.(*SystemCollector).Collect /go/src/github.com/jenningsloy318/redfish_exporter/collector/system_collector.go:532 +0xbdf

 - redfish_exporter.yml

hosts: 0.0.0.0: username:admin username:admin ... groups: redfish_hpe: username:admin username:admin

 - prometheus.yml

my global config

global: scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting: alertmanagers:

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

Are there any ideas solve this problem? I have no idea to debug this problem because of sudden panic without specific log.

Additionally, Do you have any data about limit scales by this exporter?

Thank you!

jakubmikusek commented 1 year ago

I'm hitting very similar issue:

2023/02/09 14:51:33  info app started. listening on :9610 app=redfish_exporter
level=info msg="TLS is disabled." http2=false
2023/02/09 15:02:30  info no network interface data found System=437XR1138R2 app=redfish_exporter collector=SystemCollector operation=system.NetworkInterfaces() target=10.128.0.7:8000
2023/02/09 15:02:31  info no PCI-E device function data found System=437XR1138R2 app=redfish_exporter collector=SystemCollector operation=system.PCIeFunctions() target=10.128.0.7:8000
2023/02/09 15:02:31  info collector scrape completed System=437XR1138R2 app=redfish_exporter collector=SystemCollector target=10.128.0.7:8000
panic: send on closed channel

goroutine 371 [running]:
github.com/jenningsloy318/redfish_exporter/collector.parseDevice(0xc000352678?, {0xc0000b61ba, 0x6}, {{0xc000282010, 0xa}, 0x0, {0x0, 0x0}, {0x0, 0x0}, ...}, ...)
        /go/src/collector/system_collector.go:675 +0x217
created by github.com/jenningsloy318/redfish_exporter/collector.(*SystemCollector).Collect
        /go/src/collector/system_collector.go:583 +0x1fed

Any hints how to debug further?

Thanks!