prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.98k stars 2.33k forks source link

rapl collector crash: panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name #2299

Closed baryluk closed 2 years ago

baryluk commented 2 years ago

Host operating system: output of uname -a

Linux xyz 4.18.0-358.el8.x86_64 #1 SMP Mon Jan 10 13:11:20 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.3.1 (branch: tarball, revision: 4.el8)
  build user:       
  build date:       20220128
  go version:       go1.16.12
  platform:         linux/amd64

node_exporter command line flags

/usr/bin/prometheus-node-exporter --collector.textfile.directory /var/lib/prometheus/node-exporter --collector.tcpstat --collector.ntp --collector.interrupts --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+|run/k3s/containerd/.+|run/user/.+|mnt/ceph/.+|mnt/s3fs/.+)($|/) --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs|ceph|fuse.s3fs|fuse.sshfs|fuse.portal|fuse.gvfsd-fuse)$

Are you running node_exporter in Docker?

no

What did you do that produced an error?

Running on bare metal on AMD EPYC 7401P 24-Core Processor

After upgrade of the OS, kernel and node exporter, we are getting this:

panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name
goroutine 82 [running]:
panic(0x55f3e1a19920, 0xc0006260e0)
/usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc00052ac80 sp=0xc00052abb8 pc=0x55f3e0ed72e5
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
/builddir/build/BUILD/node_exporter-1.3.1/vendor/github.com/prometheus/client_golang/prometheus/value.go:107
github.com/prometheus/node_exporter/collector.(*raplCollector).Update(0xc0002b4f20, 0xc0001138c0, 0x55f3e1f16fa0, 0x0)
/builddir/build/BUILD/node_exporter-1.3.1/collector/rapl_linux.go:88 +0xbf7 fp=0xc00052ae10 sp=0xc00052ac80 pc=0x55f3e154aeb7
github.com/prometheus/node_exporter/collector.execute(0x55f3e1582074, 0x4, 0x55f3e1b15408, 0xc0002b4f20, 0xc0001138c0, 0x55f3e1b14c08, 0xc0002ba500)
/builddir/build/BUILD/node_exporter-1.3.1/collector/collector.go:161 +0x86 fp=0xc00052af50 sp=0xc00052ae10 pc=0x55f3e14ee5e6
github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0001138c0, 0xc000289c20, 0x55f3e1b14c08, 0xc0002ba500, 0xc00039eb30, 0x55f3e1582074, 0x4, 0x55f3e1b15408, 0xc0002b4f20)
/builddir/build/BUILD/node_exporter-1.3.1/collector/collector.go:152 +0x6f fp=0xc00052af98 sp=0xc00052af50 pc=0x55f3e156e3af
runtime.goexit()
....
discordianfish commented 2 years ago

@baryluk Thanks for the report. Looks like we need to sanitize the name by wrapping rz.Name here: https://github.com/prometheus/node_exporter/blob/2b490d645e0e9773b644dfc3e3313e79eb565b27/collector/rapl_linux.go#L83 in SanitizeMetricName(). Want to take a stab at this?

baryluk commented 2 years ago

Yes. No problem, i will check the test framework for it first.

On Tue, 1 Mar 2022, 11:10 discordianfish, @.***> wrote:

@baryluk https://github.com/baryluk Thanks for the report. Looks like we need to sanitize the name by wrapping rz.Name here: https://github.com/prometheus/node_exporter/blob/2b490d645e0e9773b644dfc3e3313e79eb565b27/collector/rapl_linux.go#L83 in SanitizeMetricName(). Want to take a stab at this?

— Reply to this email directly, view it on GitHub https://github.com/prometheus/node_exporter/issues/2299#issuecomment-1055259039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA254VCYAP7IFUGP5DZNIDU5XUK5ANCNFSM5PTRP3UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

alanorth commented 2 years ago

node_exporter is crashing every fifteen seconds when I scrape due to this bug. I'm also running on a bare metal AMD EPYC system:

# lscpu | grep Model\ name:
Model name:          AMD EPYC 7451 24-Core Processor
BIOS Model name:     AMD EPYC 7451 24-Core Processor

Are there plans to tag a new release for this bug fix? Thank you!

SuperQ commented 2 years ago

Re-opening this, as I want to add a "make this a label" option to the collector.

stephankoelle commented 2 years ago

Same here with AMD EPYC 7401P 24-Core Processor

jardaKalus commented 2 years ago

Same here with AMD EPYC 7351 16-Core Processor

aneagoe commented 2 years ago

@SuperQ would be great to have a new release once this is closed. Thanks!