Open Xaraxia opened 2 years ago
See #2190 for a potential solution. The current perf collector setup is kind of a all or nothing approach.
This is also occurring on our AMD EPYC 7662 64-Core Processor
systems.
We get a different crash on Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
where stalled-cycles-backend
and stalled-cycles-frontend
are missing.
panic: Couldn't create metrics handler: couldn't create collector: Failed to setup stalled fronted cycles profiler: pid (-1) cpu (0) "no such file or directory"; Failed to setup stalled backend cycles profiler: pid (-1) cpu (0) "no such file or directory"
$ perf list hw
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
I think #2191 picked up some of those changes. You probably need to be running 1.4.1 or higher. I opened #2553 for further improvements on error messages and handling of debugfs mounts.
Here's an example from a host without frontend/backend stalled instructions:
Host operating system:
4.18.0-348.el8.0.2.x86_64 #1 SMP Sun Nov 14 00:51:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
node_exporter command line flags
--collector.perf
Are you running node_exporter in Docker?
No
What did you do that produced an error?
Ran the exporter
What did you expect to see?
Correctly running, able to get metrics
What did you see instead?
(Note running as root to exclude sysctl/capability issues, but we see the same issue running as a user with these set appropriately):
I've done a fair bit of digging, and as far as I can tell this is being thrown all the way up the chain from https://pkg.go.dev/golang.org/x/sys/unix#PerfEventOpen
I'm not sure whether the issue is in the perf-utils or higher/lower in the chain.
Perf is installed, perf list hw gives the following:
So I'd expect bus-cycles to fail because it's not there. Source code is using PERF_COUNT_HW_REF_CPU_CYCLES rather than PERF_COUNT_HW_CPU_CYCLES, which is appropriate from what I'm reading, but I suspect is also not supported on our architecture.
I think there needs to be some configuration options regarding which of these are being tracked (or automatic detection, but that's going to be more work to code up for little gain IMO).
Thanks for the work on the exporter.