prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.11k stars 2.35k forks source link

Node exporter reporting data for too many CPUs on FreeBSD #2025

Open davehayes opened 3 years ago

davehayes commented 3 years ago

Host operating system: output of uname -a

FreeBSD 12.2-STABLE r368820 amd64

node_exporter version: output of node_exporter --version

node_exporter, version 1.0.1 (branch: release-1.0, revision: 0) build user: root build date:
go version: go1.15.6

node_exporter command line flags

--collector.textfile.directory=/some/where --collector.devstat --collector.ntp

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

So hw.ncpu is 16, that's 16 cores. machdep.hyperthreading_allowed: 0 is also set. This is a Ryzen 3950X.

node_cpu_seconds_total{ cpu="30", mode="idle" } ... this value is 0. According to our discussion in matrix, that's a bug.

It turns out that kern.cp_times is likely the culprit as it has a bunch of 0s appended here:

# sysctl kern.cp_times
kern.cp_times: 119169 344788 336514 125390 896317803 115801 274757 654468 69366 896129272 53798 361436 309501 74719 896444186 154879 386182 369249 78511 896254839 170832 359904 397446 2341 896313117 178544 288452 480246 2527 896293871 178332 351178 408500 3398 896302256 189411 403573 425310 2473 896222895 158723 367277 508940 2426 896206286 106010 304752 473477 2622 896356800 137008 400359 367762 2198 896336334 172556 412512 416235 2464 896239897 187877 375365 424755 2333 896253331 171498 308979 409341 2573 896351273 184604 406386 432323 2395 896217932 196308 415485 487813 2333 896141701 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Here's the dmesg information on the CPU I have:

CPU: AMD Ryzen 9 3950X 16-Core Processor             (3493.52-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x870f10  Family=0x17  Model=0x71  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,<b30>>
  Structured Extended Features=0x219c91a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>
  Structured Extended Features2=0x400004<UMIP,RDPID>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x10cb657<CLZERO,IRPerf,XSaveErPtr>
  SVM: (disabled in BIOS) NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

What did you expect to see?

I expect to see one cpu label in node_cpu_seconds_total per actual CPU, with no cpu label greater than the value of hw.ncpu.

What did you see instead?

Let C be the value of hw.ncpu. I saw node_cpu_sections_total with labels from C to 2C by 1, each with n actual value of 0.

SuperQ commented 3 years ago

Thanks for the detailed issue. It'd be nice to get some more info from FreeBSD experts on this one.

davehayes commented 3 years ago

I've been searching around various mailing lists. It seems there's information I didn't see before that might help in this case. This information is from the machine that had the bug:

# sysctl -d kern.smp
kern.smp: Kernel SMP
kern.smp.forward_signal_enabled: Forwarding of a signal to a process on a different CPU
kern.smp.topology: Topology override setting; 0 is default provided by hardware.
kern.smp.cores: Number of physical cores online
kern.smp.threads_per_core: Number of SMT threads online per core
kern.smp.cpus: Number of CPUs online
kern.smp.disabled: SMP has been disabled from the loader
kern.smp.active: Indicates system is running in SMP mode
kern.smp.maxcpus: Max number of CPUs that the system was compiled for.
kern.smp.maxid: Max CPU ID.
# sysctl kern.smp
kern.smp.forward_signal_enabled: 1
kern.smp.topology: 0
kern.smp.cores: 16
kern.smp.threads_per_core: 1
kern.smp.cpus: 16
kern.smp.disabled: 0
kern.smp.active: 1
kern.smp.maxcpus: 256
kern.smp.maxid: 31

I think this section of sysctl MIB will tell you all you need to know. I suggest using kern.smp.cores to limit kern.cp_times myself.

zalegrala commented 3 weeks ago

On a 14.1-RELEASE system, I have the following.

$ sysctl -d kern.smp
kern.smp: Kernel SMP
kern.smp.forward_signal_enabled: Forwarding of a signal to a process on a different CPU
kern.smp.topology: Topology override setting; 0 is default provided by hardware.
kern.smp.cores: Number of physical cores online
kern.smp.threads_per_core: Number of SMT threads online per core
kern.smp.cpus: Number of CPUs online
kern.smp.disabled: SMP has been disabled from the loader
kern.smp.active: Indicates system is running in SMP mode
kern.smp.maxcpus: Max number of CPUs that the system was compiled for.
kern.smp.maxid: Max CPU ID.

$ sysctl kern.smp
kern.smp.forward_signal_enabled: 1
kern.smp.topology: 0
kern.smp.cores: 8
kern.smp.threads_per_core: 1
kern.smp.cpus: 8
kern.smp.disabled: 0
kern.smp.active: 1
kern.smp.maxcpus: 1024
kern.smp.maxid: 7

$ sysctl kern.cp_times
kern.cp_times: 1355469 75965 1390756 319500 49046024 1365745 77106 1395159 315505 49034199 1332836 75027 1263967 529761 48986123 1364558 77122 1393860 321999 49030175 1365142 76037 1404528 312600 49029407 1368658 77946 1384338 309564 49047208 1366558 74164 1401735 315365 49029892 1360227 74035 1370883 357813 49024756

$ sysctl hw.ncpu
hw.ncpu: 8