prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11k stars 2.34k forks source link

perf collector not working on EPYC CPU #2469

Open Xaraxia opened 2 years ago

Xaraxia commented 2 years ago

Host operating system:

4.18.0-348.el8.0.2.x86_64 #1 SMP Sun Nov 14 00:51:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7b940ddcff26873612bccdf7cd4c42b6b6)
  build user:       root@243aafa5525c
  build date:       20211205-11:09:49
  go version:       go1.17.3
  platform:         linux/amd64

node_exporter command line flags

--collector.perf

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Ran the exporter

What did you expect to see?

Correctly running, able to get metrics

What did you see instead?

(Note running as root to exclude sysctl/capability issues, but we see the same issue running as a user with these set appropriately):

ts=2022-09-14T00:55:11.418Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.3.1, branch=HEAD, revision=a2321e7b940ddcff26873612bccdf7cd4c42b6b6)"
ts=2022-09-14T00:55:11.418Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.17.3, user=root@243aafa5525c, date=20211205-11:09:49)"
ts=2022-09-14T00:55:11.418Z caller=node_exporter.go:185 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unpriviledged user, root is not required."
ts=2022-09-14T00:55:11.419Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)
ts=2022-09-14T00:55:11.419Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
panic: Couldn't create metrics handler: couldn't create collector: Failed to setup bus cycles profiler: pid (-1) cpu (0) "no such file or directory"; Failed to setup ref CPU cycles profiler: pid (-1) cpu (0) "no such file or directory"

goroutine 1 [running]:
main.newHandler(0x1, 0x28, {0xbfa3c0, 0xc00035a4c0})
    /app/node_exporter.go:66 +0x245
main.main()
    /app/node_exporter.go:188 +0x1165

I've done a fair bit of digging, and as far as I can tell this is being thrown all the way up the chain from https://pkg.go.dev/golang.org/x/sys/unix#PerfEventOpen

I'm not sure whether the issue is in the perf-utils or higher/lower in the chain.

Perf is installed, perf list hw gives the following:

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

So I'd expect bus-cycles to fail because it's not there. Source code is using PERF_COUNT_HW_REF_CPU_CYCLES rather than PERF_COUNT_HW_CPU_CYCLES, which is appropriate from what I'm reading, but I suspect is also not supported on our architecture.

I think there needs to be some configuration options regarding which of these are being tracked (or automatic detection, but that's going to be more work to code up for little gain IMO).

Thanks for the work on the exporter.

hodgesds commented 2 years ago

See #2190 for a potential solution. The current perf collector setup is kind of a all or nothing approach.

gburek-fastly commented 1 year ago

This is also occurring on our AMD EPYC 7662 64-Core Processor systems.

We get a different crash on Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz where stalled-cycles-backend and stalled-cycles-frontend are missing.

panic: Couldn't create metrics handler: couldn't create collector: Failed to setup stalled fronted cycles profiler: pid (-1) cpu (0) "no such file or directory"; Failed to setup stalled backend cycles profiler: pid (-1) cpu (0) "no such file or directory"

$ perf list hw
List of pre-defined events (to be used in -e):
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

2190 looks like a good solution.

hodgesds commented 1 year ago

I think #2191 picked up some of those changes. You probably need to be running 1.4.1 or higher. I opened #2553 for further improvements on error messages and handling of debugfs mounts.

Here's an example from a host without frontend/backend stalled instructions: 2022-12-21-21:10:40