prometheus / procfs

procfs provides functions to retrieve system, kernel and process metrics from the pseudo-filesystem proc.
Apache License 2.0
784 stars 320 forks source link

Support metrics for offline CPUs #84

Open mjtrangoni opened 6 years ago

mjtrangoni commented 6 years ago

Hi @rtreffer @SuperQ , This issue is related to #873.

When parsing the /proc/stat file, I am missing the latest offline CPUs bunch of metrics (from cpu154 to cpu159). As @brian-brazil said, the CPU metrics should always be there.

See:

# lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121,128,129,136,137,144,145,152,153
Off-line CPU(s) list:  2-7,10-15,18-23,26-31,34-39,42-47,50-55,58-63,66-71,74-79,82-87,90-95,98-103,106-111,114-119,122-127,130-135,138-143,146-151,154-159
Thread(s) per core:    2
Core(s) per socket:    5
Socket(s):             4
NUMA node(s):          4
Model:                 2.1 (pvr 004b 0201)
Model name:            POWER8E (raw), altivec supported
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0,1,8,9,16,17,24,25,32,33
NUMA node1 CPU(s):     40,41,48,49,56,57,64,65,72,73
NUMA node16 CPU(s):    80,81,88,89,96,97,104,105,112,113
NUMA node17 CPU(s):    120,121,128,129,136,137,144,145,152,153
cpu  8955653 5338 10313729 6891866013 1194210 0 38962 0 0 0
cpu0 138803 9 56167 172504763 16187 0 1296 0 0 0
cpu1 322651 754 427280 171926334 25235 0 1291 0 0 0
cpu8 199865 3 91024 172386730 20071 0 646 0 0 0
cpu9 326453 474 412719 171934902 24410 0 723 0 0 0
cpu16 181309 2 66982 172442461 21437 0 788 0 0 0
cpu17 317509 348 398066 171978692 19749 0 711 0 0 0
cpu24 162611 8 61226 172478776 28065 0 707 0 0 0
cpu25 320518 335 402933 171988746 27002 0 653 0 0 0
cpu32 167024 9 60329 172464237 24645 0 857 0 0 0
cpu33 300664 484 388081 171994667 15890 0 721 0 0 0
cpu40 149963 1 97562 172440250 57631 0 1636 0 0 0
cpu41 349011 123 504120 171857581 42197 0 2032 0 0 0
cpu48 119442 1 74060 172508062 37574 0 2162 0 0 0
cpu49 346802 119 487142 171870884 36441 0 2296 0 0 0
cpu56 133608 3 73781 172488230 30166 0 1639 0 0 0
cpu57 340640 144 493004 171860490 33535 0 2412 0 0 0
cpu64 122117 5 68766 172506171 37048 0 1620 0 0 0
cpu65 346848 142 490790 171861649 44282 0 1396 0 0 0
cpu72 138939 3 67941 172506876 29311 0 1300 0 0 0
cpu73 349307 172 496688 171860930 35681 0 1120 0 0 0
cpu80 139125 92 94140 172450292 54207 0 659 0 0 0
cpu81 295747 183 411438 172009728 31455 0 593 0 0 0
cpu88 96750 62 60035 172573444 25950 0 563 0 0 0
cpu89 319147 509 476759 171926997 34378 0 489 0 0 0
cpu96 101846 22 78433 172521391 21805 0 624 0 0 0
cpu97 275081 192 401865 172034352 27952 0 491 0 0 0
cpu104 117902 134 74486 172523631 25512 0 683 0 0 0
cpu105 266655 380 426705 172028466 29963 0 488 0 0 0
cpu112 97858 34 47786 172583361 23911 0 598 0 0 0
cpu113 287918 184 437298 171997757 27889 0 468 0 0 0
cpu120 129084 14 66533 172521771 23712 0 795 0 0 0
cpu121 362991 50 524281 171833723 27812 0 779 0 0 0
cpu128 120565 4 57278 172552862 20093 0 858 0 0 0
cpu129 328225 144 477274 171917779 24834 0 657 0 0 0
cpu136 101479 4 55334 172573896 17711 0 723 0 0 0
cpu137 310827 84 447858 171966775 23698 0 619 0 0 0
cpu144 120642 0 52113 172561125 17948 0 696 0 0 0
cpu145 279375 37 412635 172039812 20124 0 559 0 0 0
cpu152 88002 7 48537 172507258 81117 0 633 0 0 0
cpu153 279513 63 420666 172016877 25035 0 969 0 0 0
intr 1382710716 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 301836321 4861826 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1074012 460 97 64 627135 48 28 22 711427 78 58 13 718673 25 63 26 71
4420 96 64 19 1227903 240 356 379 1135858 266 173 36 1087877 488 336 265 1165729 526 215 94 1034760 368 314 143 1028797 47 71 59 873159 77 42 50 832224 104 156 94 844309 90 106 96 793847 93
29 75 1050764 35 18 1 546452 434979 48 39 4 18 22 0 458336 399137 40 61 11 32 23 0 455018 358193 54 3 0 4 18 1 345891 369552 25 14 31 6 28 34 0 1915 1172145 572915 910960 391591 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 515 290 91 667877 63 14 41 723471 21 18 27 710718 113 42 98 698773 11 70 45 1212026 677 589 614 1161716 251 318 213 1130302
234 265 308 1156661 516 137 323 1038834 70 117 135 1007899 25 37 81 851490 53 60 57 827356 41 28 39 832789 54 25 7 784532 62 75 26 816984 42 22 64 552798 437519 40 66 23 22 36 2 448306 45867
0 47 71 1 12 18 15 433335 406005 29 9 41 13 1 0 16 0 0 0 6 377927 477440 58 25 19 13 15 7 0 0 4292776 1009617 403 169 108 685906 51 51 47 706091 73 85 65 724938 87 25 96 729496 78 31 58 1210
148 115 236 210 1169285 206 312 209 1195163 327 210 297 1227099 400 437 369 1082046 273 66 175 926198 51 62 55 858369 81 43 41 819595 74 39 31 834143 87 1 7 814523 114 64 14 766292 32 45 46
551690 541114 52 51 21 0 1 5 480895 501190 26 24 11 0 0 11 448565 377180 6 16 1 19 0 1 360328 394864 25 20 3 27 18 34 1124560 45651 467041 0 8597 15531 1 1 4 4 356295 344451 4 22 38 14 42976
8 376309 10 5 8 18 478115 484921 20 5 8 29 561639 546080 74 96 46 19 800801 42 0 0 13 0 785645 13 15 4 1 200 883121 38 37 63 171 241 1217454 325 352 205 537 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1336145 297 250 0 0 0 0 64 1189213 114 716263 56 106 728956 50 128 669205 49 60 656438 65 447284 836474 325 447284 447284 447284 0 0 447285 447285 447284 447284 447285 32 447283 447284 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 447285 447285 447285 447283 158 62 76 121 85 259 214 340 209 1271402 424 1222407 167 862852 0 820882 0 697519 21 15 1 19 34 0 18 22 20 1083606 1 1 1 1 1 1 1 1
 1 1 1 1 1 55650520 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 0 0 0 0 0 0 0 0 0 0 55658426 12660943 853479 80702 4518888 205165 71525 405256 18899159 0 0 0 0 0 0 0 2
2538547 3375254 1736399 680537 20693729 12337920 3928537 851492 0 0 0 0 0 0 0 0
ctxt 1489044540
btime 1521031332
processes 17180357
procs_running 1
procs_blocked 0
softirq 928568474 562 279676948 9739565 110787468 78190096 6693156 9823146 255584320 0 178073213
$ curl -s xxx:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'cpu="151"'
node_cpu_seconds_total{cpu="151",mode="idle"} 0
node_cpu_seconds_total{cpu="151",mode="iowait"} 0
node_cpu_seconds_total{cpu="151",mode="irq"} 0
node_cpu_seconds_total{cpu="151",mode="nice"} 0
node_cpu_seconds_total{cpu="151",mode="softirq"} 0
node_cpu_seconds_total{cpu="151",mode="steal"} 0
node_cpu_seconds_total{cpu="151",mode="system"} 0
node_cpu_seconds_total{cpu="151",mode="user"} 0
$ curl -s xxx:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'cpu="152"'                                                                                                                   
node_cpu_seconds_total{cpu="152",mode="idle"} 1.72526771e+06
node_cpu_seconds_total{cpu="152",mode="iowait"} 811.17
node_cpu_seconds_total{cpu="152",mode="irq"} 0
node_cpu_seconds_total{cpu="152",mode="nice"} 0.07
node_cpu_seconds_total{cpu="152",mode="softirq"} 6.33
node_cpu_seconds_total{cpu="152",mode="steal"} 0
node_cpu_seconds_total{cpu="152",mode="system"} 485.37
node_cpu_seconds_total{cpu="152",mode="user"} 880.05
$ curl -s xxx:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'cpu="153"'                                                                                                                   
node_cpu_seconds_total{cpu="153",mode="idle"} 1.72036582e+06
node_cpu_seconds_total{cpu="153",mode="iowait"} 250.35
node_cpu_seconds_total{cpu="153",mode="irq"} 0
node_cpu_seconds_total{cpu="153",mode="nice"} 0.63
node_cpu_seconds_total{cpu="153",mode="softirq"} 9.69
node_cpu_seconds_total{cpu="153",mode="steal"} 0
node_cpu_seconds_total{cpu="153",mode="system"} 4206.78
node_cpu_seconds_total{cpu="153",mode="user"} 2795.17
$ curl -s xxx:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'cpu="154"'       
(no metrics)
[...]
$ curl -s xxx:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'cpu="159"'       
(no metrics)

Summarizing, every CPU metrics of an offline CPU until cpu151 are zero, while the last bunch cpu{154..159} are missing completely .

mjtrangoni commented 6 years ago

As I wrote in #873, an option/fix would be iterating over the /sys/devices/system/cpu/present file. See,

grep . /sys/devices/system/cpu/{online,offline,possible,present}
/sys/devices/system/cpu/online:0-1,8-9,16-17,24-25,32-33,40-41,48-49,56-57,64-65,72-73,80-81,88-89,96-97,104-105,112-113,120-121,128-129,136-137,144-145,152-153
/sys/devices/system/cpu/offline:2-7,10-15,18-23,26-31,34-39,42-47,50-55,58-63,66-71,74-79,82-87,90-95,98-103,106-111,114-119,122-127,130-135,138-143,146-151,154-159
/sys/devices/system/cpu/possible:0-159
/sys/devices/system/cpu/present:0-159
grobie commented 6 years ago

Thanks for the information @mjtrangoni. I took the liberty to rename the issue a bit. There is no support for offline CPUs availble in this procfs library at the moment. Furthermore, the node_exporter doesn't currently use this procfs library for CPU metrics, but this is something we should change. Your /proc/stat file will be of great help to write a test.

mjtrangoni commented 6 years ago

Hi @grobie, I found this looking at node_exporter's collector/cpu_linux.go, at line 199, when it calls the updateStat function. Iterating over the present file will give you the total amount of present CPUs. At the /proc/statfile you see only the online ones. See also this

grobie commented 6 years ago

Thanks for the clarification @mjtrangoni, my bad, I had missed that the node_exporter uses procfs by now.

In either case, I don't think that the procfs library does anything wrong here, it's meant to be a library to access procfs information from go. It would be the node_exporter's responsibility to combine the information from /sys/devices/system/cpu/present and /proc/stat. We should support parsing the /sys/devices/system/cpu/*information in the sysfs package though.

mjtrangoni commented 6 years ago

What I think is wrong or not 100% right is that, in my case, all intermediate metrics of offline CPUs are initialized to 0, while the ones from cpu154 onwards are not. I have to double-check that! And I really like the idea of exposing the /sys/devices/system/cpu/* information, but I am not sure how much of them. I can make a PR exporting the /sys/devices/system/cpu/{online,offline,possible,present} information as a first-implementation.

grobie commented 6 years ago

You're right, we're currently exposing an array with zero value CPUStat types for offline CPUs. It was an oversight from my side (I wasn't thinking of offline CPUs). I don't believe this library should make up any information which are not actually exposed by the procfs. So we should change the returned data type. Either a map[uint]CPUStat would work (but unordered in golang, not so great) or we add a Processor uint attribute to the CPUStat.