proposal: runtime/pprof: add PMU-based profiles

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

123.27k stars 17.57k forks source link

proposal: runtime/pprof: add PMU-based profiles #36821

Open chabbimilind opened 4 years ago

chabbimilind commented 4 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

Yes; tested on go version devel +74d366f484 Mon Jan 27 20:45:39 2020 +0000 linux/amd64

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xxx/.cache/go-build"
GOENV="/home/xxx/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xxx/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/xxx/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/xxx/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build696469159=/tmp/go-build -gno-record-gcc-switches"

What did you do?

The following experiments demonstrate that pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs). The tests are highly predictive in nature and involve little runtime overheads (allocation, GC, system call, etc.). They are carefully designed so that we can compare pprof CPU profiles against our expectations. The evaluation shows a wide difference from expectation. One should not confuse this to be a runtime issue; the issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid. In a proposal/design document that will follow, I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters. PMU event-based sampling is a mature technology available on almost all modern CPUs. PMU as a sampling agent offers many benefits (in no particular priority order):

Enhance sampling granularity from 100s of milli-second to 100s of microseconds.
Avoid precision and accuracy problems of OS timers.
Enable collecting rich hardware profiles such as cache misses, branch misses, inter-socket cacheline transfers, to name a new, which help root cause performance problems more easily.

There are two test cases goroutine.go and serial.go in this issue.

test 1 (parallel)

Download the first test case, goroutine.go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go It has ten exactly the same goroutines: f1-f10 and I use pprof to collect the CPU profiles. Run the following command several times and notice that each time, pprof reports a different amount of time spent in each of the ten routines.

go run goroutine.go && go tool pprof -top goroutine_prof

test 2 (serial)

Download the second test case, serial.go go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go: It has ten functions (A_expect_1_82 - J_expect_18_18). The function A_expect_1_82 is expected to consume 1.82% of the total execution time and J_expect_18_18 is expected to consume 18.18% of the execution time, and so on. The code is serial and there is complete data dependence between each function and each iteration of the loop in the functions to avoid any hardware-level optimizations. Run the following command several times. go run serial.go && go tool pprof -top serial_prof

What did you expect to see?

For goroutine.go, each function (f1-10) should be attributed with exactly (or almost exactly) 10% of the execution time on each run.

For serial.go the time attribution should roughly follow the distribution shown below in each run.

FUNCTION NAME           EXPECTED RELATIVE TIME

A_expect_1_82           1.82%
B_expect_3_64           3.64%
C_expect_5_46           5.46%
D_expect_7_27           7.27%
E_expect_9_09           9.09%
F_expect_10_91          10.91%
G_expect_12_73          12.73%
H_expect_14_546         14.546%
I_expect_16_36          16.36%
J_expect_18_18          18.18%

What did you see instead?

test 1 (parallel)

Run 1, 2, and 3, shown below, respectively show the pprof -top output of goroutine.go. You will notice in a single run (say Run 1), f1-f10 have a wide variance in the time attributed to them; the expectation is that each of them gets 10% execution time. There is up to 6x difference in time attributed to the function with the highest amount if attribution (main.f7, 4210ms, in Run 1) vs. the function with the lowest a amount of attribution (main.f9, 700ms, in Run 1). This shows a poor accuracy (deviation from the ground truth) of pprof timer-based profiles.

Furthermore, the time attributed to a function widely varies from run to run. Notice how the top-10 ordering changes. In Run 1, main.f7 is shown to run for 4210ms, whereas in Run 2 it is shown to run for only 520ms. The expectation is that the measurements remain the same from run to run. This shows a poor precision (unpredictability of measurement) of pprof timer-based profiles.

goroutine.go/Run 1:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.70s, Total samples = 18060ms (269.37%)
Showing nodes accounting for 18060ms, 100% of 18060ms total
      flat  flat%   sum%        cum   cum%
    4210ms 23.31% 23.31%     4210ms 23.31%  main.f7
    2610ms 14.45% 37.76%     2610ms 14.45%  main.f2
    2010ms 11.13% 48.89%     2010ms 11.13%  main.f6
    1810ms 10.02% 58.91%     1810ms 10.02%  main.f10
    1780ms  9.86% 68.77%     1780ms  9.86%  main.f3
    1410ms  7.81% 76.58%     1410ms  7.81%  main.f1
    1310ms  7.25% 83.83%     1310ms  7.25%  main.f4
    1110ms  6.15% 89.98%     1110ms  6.15%  main.f5
    1110ms  6.15% 96.12%     1110ms  6.15%  main.f8
     700ms  3.88%   100%      700ms  3.88%  main.f9

goroutine.go/Run 2:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.71s, Total samples = 17400ms (259.39%)
Showing nodes accounting for 17400ms, 100% of 17400ms total
      flat  flat%   sum%        cum   cum%
    3250ms 18.68% 18.68%     3250ms 18.68%  main.f2
    2180ms 12.53% 31.21%     2180ms 12.53%  main.f9
    2100ms 12.07% 43.28%     2100ms 12.07%  main.f1
    1770ms 10.17% 53.45%     1770ms 10.17%  main.f6
    1700ms  9.77% 63.22%     1700ms  9.77%  main.f5
    1550ms  8.91% 72.13%     1550ms  8.91%  main.f4
    1500ms  8.62% 80.75%     1500ms  8.62%  main.f8
    1440ms  8.28% 89.02%     1440ms  8.28%  main.f3
    1390ms  7.99% 97.01%     1390ms  7.99%  main.f10
     520ms  2.99%   100%      520ms  2.99%  main.f7

goroutine.go/Run 3:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:48pm (PST)
Duration: 6.71s, Total samples = 17.73s (264.31%)
Showing nodes accounting for 17.73s, 100% of 17.73s total
      flat  flat%   sum%        cum   cum%
     3.74s 21.09% 21.09%      3.74s 21.09%  main.f7
     2.08s 11.73% 32.83%      2.08s 11.73%  main.f9
     2.05s 11.56% 44.39%      2.05s 11.56%  main.f2
     1.85s 10.43% 54.82%      1.85s 10.43%  main.f10
     1.78s 10.04% 64.86%      1.78s 10.04%  main.f1
     1.43s  8.07% 72.93%      1.43s  8.07%  main.f3
     1.42s  8.01% 80.94%      1.42s  8.01%  main.f8
     1.18s  6.66% 87.59%      1.18s  6.66%  main.f6
     1.17s  6.60% 94.19%      1.17s  6.60%  main.f5
     1.03s  5.81%   100%      1.03s  5.81%  main.f4

test 2 (serial)

The output for go run serial.go && go tool pprof -top serial_prof for three runs is shown below. Comparing the flat% (or cum%) against the expected percentage for each function shows a large difference. For example, main.H_expect_14_546, which is expected to have 14.546% execution time, is attributed 25% of the execution time in Run 1. Furthermore, run to run, there is a lack of precision, for example, main.I_expect_16_36 is attributed 6.25% (20ms) execution time in Run 1, whereas it is attributed 21.88% (70ms) execution time in Run 2.

serial.go/Run 1:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:42pm (PST)
Duration: 501.51ms, Total samples = 320ms (63.81%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      80ms 25.00% 25.00%       80ms 25.00%  main.H_expect_14_546
      80ms 25.00% 50.00%       80ms 25.00%  main.J_expect_18_18
      60ms 18.75% 68.75%       60ms 18.75%  main.G_expect_12_73
      20ms  6.25% 75.00%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 81.25%       20ms  6.25%  main.D_expect_7_27
      20ms  6.25% 87.50%       20ms  6.25%  main.F_expect_10_91
      20ms  6.25% 93.75%       20ms  6.25%  main.I_expect_16_36
      10ms  3.12% 96.88%       10ms  3.12%  main.A_expect_1_82
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main

serial.go/Run 2:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:44pm (PST)
Duration: 501.31ms, Total samples = 320ms (63.83%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      70ms 21.88% 21.88%       70ms 21.88%  main.I_expect_16_36
      50ms 15.62% 37.50%       50ms 15.62%  main.J_expect_18_18
      40ms 12.50% 50.00%       40ms 12.50%  main.E_expect_9_09
      40ms 12.50% 62.50%       40ms 12.50%  main.F_expect_10_91
      40ms 12.50% 75.00%       40ms 12.50%  main.H_expect_14_546
      30ms  9.38% 84.38%       30ms  9.38%  main.D_expect_7_27
      20ms  6.25% 90.62%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 96.88%       20ms  6.25%  main.G_expect_12_73
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main

serial.go/Run 3:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:45pm (PST)
Duration: 501.39ms, Total samples = 310ms (61.83%)
Showing nodes accounting for 310ms, 100% of 310ms total
      flat  flat%   sum%        cum   cum%
     110ms 35.48% 35.48%      110ms 35.48%  main.J_expect_18_18
      70ms 22.58% 58.06%       70ms 22.58%  main.G_expect_12_73
      60ms 19.35% 77.42%       60ms 19.35%  main.F_expect_10_91
      30ms  9.68% 87.10%       30ms  9.68%  main.I_expect_16_36
      20ms  6.45% 93.55%       20ms  6.45%  main.H_expect_14_546
      10ms  3.23% 96.77%       10ms  3.23%  main.B_expect_3_64
      10ms  3.23%   100%       10ms  3.23%  main.C_expect_5_46
         0     0%   100%      310ms   100%  main.main
         0     0%   100%      310ms   100%  runtime.main

Improved results with PMU-based profiling.

In a prototype PMU-based profiling implementation, below are the pprof profiles for CPU cycles hardware performance counter for the same goroutine.go program. Notice that each functions gets the same (or almost the same) CPU cycles attribution within a single run and across runs.

goroutine.go/Run 1:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:49pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23400000000 10.00% 10.00% 23400000000 10.00%  main.f1
23400000000 10.00% 20.00% 23400000000 10.00%  main.f10
23400000000 10.00% 30.00% 23400000000 10.00%  main.f2
23400000000 10.00% 40.00% 23400000000 10.00%  main.f3
23400000000 10.00% 50.00% 23400000000 10.00%  main.f4
23400000000 10.00% 60.00% 23400000000 10.00%  main.f5
23400000000 10.00% 70.00% 23400000000 10.00%  main.f6
23400000000 10.00% 80.00% 23400000000 10.00%  main.f7
23400000000 10.00% 90.00% 23400000000 10.00%  main.f8
23400000000 10.00%   100% 23400000000 10.00%  main.f9

goroutine.go/Run 2:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:51pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23800000000 10.17% 10.17% 23800000000 10.17%  main.f1
23500000000 10.04% 20.21% 23500000000 10.04%  main.f7
23500000000 10.04% 30.26% 23500000000 10.04%  main.f9
23400000000 10.00% 40.26% 23400000000 10.00%  main.f10
23400000000 10.00% 50.26% 23400000000 10.00%  main.f2
23400000000 10.00% 60.26% 23400000000 10.00%  main.f4
23400000000 10.00% 70.26% 23400000000 10.00%  main.f6
23400000000 10.00% 80.26% 23400000000 10.00%  main.f8
23300000000  9.96% 90.21% 23300000000  9.96%  main.f3
22900000000  9.79%   100% 22900000000  9.79%  main.f5

Below are the pprof profiles for CPU cycles hardware performance counter for the same serial.go program. Notice that each function gets close to expected CPU cycles attribution within a single run and across runs.

serial.go/Run 1:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 165000000 14.93% 49.59%  165000000 14.93%  main.H_expect_14_546
 137000000 12.40% 61.99%  137000000 12.40%  main.G_expect_12_73
 120000000 10.86% 72.85%  120000000 10.86%  main.F_expect_10_91
 100000000  9.05% 81.90%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.32%   82000000  7.42%  main.D_expect_7_27
  63000000  5.70% 95.02%   63000000  5.70%  main.C_expect_5_46
  37000000  3.35% 98.37%   37000000  3.35%  main.B_expect_3_64
  18000000  1.63%   100%   18000000  1.63%  main.A_expect_1_82
         0     0%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

serial.go/Run 2:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 159000000 14.39% 49.05%  159000000 14.39%  main.H_expect_14_546
 142000000 12.85% 61.90%  142000000 12.85%  main.G_expect_12_73
 119000000 10.77% 72.67%  119000000 10.77%  main.F_expect_10_91
 100000000  9.05% 81.72%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.14%   82000000  7.42%  main.D_expect_7_27
  61000000  5.52% 94.66%   61000000  5.52%  main.C_expect_5_46
  40000000  3.62% 98.28%   40000000  3.62%  main.B_expect_3_64
  18000000  1.63% 99.91%   18000000  1.63%  main.A_expect_1_82
   1000000  0.09%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

Dependence on the number of cores and length of test execution:

The results of goroutine.go test depend on the number of CPU cores available. On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds. However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem. One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.

The results of serial.go can change based on the time of execution of each function. By passing the -m=<int> to the program, you can make the program run for longer or shorter. By making it run longer, the profiles can be made more accurate, but when a function runs for less than 100ms, the accuracy is often low.

hyangah commented 4 years ago

Will it be available to non-linux OSes? If so, that's great. Previously, we've suggested Linux perf (that can use PMU events, right?) as an alternative to runtime/pprof to work around the known limitation of runtime/pprof. The pprof tool can interpret Linux perf output. If this proposal makes it possible to help non-Linux users and to allow extra information such as labels, etc while not introducing too much complexity to the runtime, that will be super.

@aclements @rsc @dvyukov @rhysh

chabbimilind commented 4 years ago

Hi Hyang-Ah Hana Kim, The design doc, which I will submit very soon, proposes a solution for Linux-based systems. If you have ideas on the windows equivalent for perf_event_open, I will include the windows extensions into the design doc. An important use-case we (at Uber) have in mind is fetching PMU-based pprof profiles over http, which is not possible via perf. Furthermore, profiling as a first-party (the process samples itself, e.g., pprof) vs. a second party (a different process samples a running service, e.g., perf) requires different security settings. Sampling as a first-party has an upper hand. Hence, enabling perf_events as a sampling engine inside Go runtime is what I am proposing. Wait for more design details to be posted soon.

-Milind

josharian commented 4 years ago

networkimprov commented 4 years ago

The pprof docs don't indicate that Linux perf delivers more accurate reports in general; is that the case? From https://golang.org/doc/diagnostics.html#profiling:

What other profilers can I use to profile Go programs?

On Linux, perf tools can be used for profiling Go programs. Perf can profile and unwind cgo/SWIG code and kernel, so it can be useful to get insights into native/kernel performance bottlenecks. On macOS, Instruments suite can be used profile Go programs.

chabbimilind commented 4 years ago

Hi Liam, perf is a commandline wrapper around the underlying perf_event_open system call that provides access to CPU's PMUs. It is neither correct nor incorrect to say perf delivers more accurate reports. PMUs can be configured in a much better way to yield superior profiles. As I have shown above, the OS timer interrupts have a low resolution; they are not guaranteed to interrupt at the same fixed interval (just sometime after the specified time) and the timer interrupts can be delivered to any thread disregarding the one whose timer expired.

chabbimilind commented 4 years ago

Related: #21295

Super useful to make this proposal more desired.

networkimprov commented 4 years ago

What's required to make Linux perf deliver the superior accuracy you propose? Is it hard to configure? Apologies if that's an FAQ...

chabbimilind commented 4 years ago

Hi Liam, Perf can be configured with higher resolution to provide richer profiles. The proposal is orthogonal to simply using perf in the following three aspects:

It will fix the current measurement problems of pprof by making it more accurate and precise.
It will not require using a separate tool like perf.
It will allow the first-party profiling (the process profiling itself rather than a separate agent like perf attaching to a process), which allows profile collection over an http port. This latter aspect of profile collection over http/pprof is particularly important for our production services, where access to run the perf tool is restricted.

gopherbot commented 4 years ago

Change https://golang.org/cl/219508 mentions this issue: proposal: hardware performance counters for CPU profiling.

rsc commented 4 years ago

/cc @aclements

rsc commented 4 years ago

Thanks for the proposal doc (https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md).

Using hardware counter to do a better job than setitimer seems fairly unobjectionable.

No new API would be an implementation detail and fairly easy to accept (assuming the implementation were not too large).

The hard part is when the new API starts getting added. We don't want to overfit to Linux, nor to x86. How much new API is really needed here? Can the proposal be useful with zero new API?

chabbimilind commented 4 years ago

Thanks Russ for your comments. You can get a general sense of the amount of change by looking at this prototype code: https://github.com/uber-dev/go/commits/pmu_pprof The changes were non-trivial, which is about 4000 lines of code. Most of it comes from defining constants for each CPU architecture and some assembly code for each architecture.

The proposal suggests introducing a handful of new APIs in the runtime/pprof package. The most prominent being StartCPUProfileWithConfig. I imagine your zero new API is not referring to this one. If it is, please let me know if you have an idea of how this could be molded into a zero new API.

Right now, we add a few PMU configurations that surface as public APIs, e.g., func CPUCycles(io.Writer, uint64) ProfilingOption and func CPUBranchInstructions(io.Writer, uint64) ProfilingOption, which can keep growing based on the demand.

If we don't want to keep this set growing, I have a solution that will introduce only one API that takes a string argument func ConfigString(io.Writer, string) ProfilingOption. Here the string can be similar, in spirit, to the format used by Linux perf. For example cycles:u@1000000 means profile CPU cycles in the u=user space at the sampling interval of 1M. Another example can be cache-misses:ukppp@10000, which means, use cache-misses events to profile, in both u=user and k=kernel space, use a skid-free precision=3 (3 p's) provided by the underlying platform (if any), and sample once in 10000 cache misses. Yet another example r0423@1000, sample the r=raw event of mask=0x04 and event id=0x23 at a sampling rate of 1000. Let me know your thoughts on going with this one API instead of several.

Let me know which parts of the proposal seem overfitted to Linux or x86. I don't think there is x86 overfitting because any architecture on which Linux runs will implement perf_event_open system call. Luckily, implementations specific to Linux are in the Linux-specific files. At the time of this proposal, we do not have another OS where we can borrow another PMU sampling implementation that will allow us to come up with a more canonical form to avoid any overfitting. Again, luckily, almost all of this is runtime internal and can be modified if the need arises to incorporate another OS.

songlh commented 4 years ago

I have a very quick question. Are the performance numbers from PMU mapped to goroutines? Based on my understanding, numbers from PMU are for different threads, while how goroutines are mapped to threads is managed by the runtime.

chabbimilind commented 4 years ago

PMUs are configured per OS thread. PMUs by themselves are unaware of goroutines. Just like an OS timer sample, the signal handler attributes the sample to the goroutine currently running on the OS thread.

songlh commented 4 years ago

@chabbimilind Thanks a lot for the reply! I feel the tool would be very helpful in performance optimization and performance failure diagnosis, since both of them rely on precise time measurement.

ianlancetaylor commented 4 years ago

@chabbimilind To a first approximation, no new API means no new API.

Is this proposal useful with no new API at all?

chabbimilind commented 4 years ago

@ianlancetaylor, the proposal will not work without a new API. That aspect is quite clearly stated in the proposal.

egonelbre commented 4 years ago

The closest thing I was able to find for Windows was https://docs.microsoft.com/en-us/windows/win32/api/_hcp/. But it requires a driver for configuring things. Windows Insider builds also have DTrace.

For OS X, Intel has special drivers https://www.campos.cc/post/using-intel-performance-counter-monitor-on-mac-os-x/.

Both platforms have separate programs to measure counters (xperf and Instruments). But there doesn't seem to be a nice way to invoke them programmatically.

I guess the canonical API is going to be processors themselves.

chabbimilind commented 4 years ago

@egonelbre, Windows and macOS, AFAIK, are not exposing system calls to access PMUs in a clean way like Linux. Depending on a library can be fragile because the libraries may not even exist and there is also the question of whether we would ever make such cgo calls from inside the go runtime.

chabbimilind commented 4 years ago

@aclements @dvyukov @rhysh please can you complete reviewing this proposal?

rsc commented 4 years ago

@chabbimilind, I really did mean zero new API, which would mean making the current CPU-based profiles more accurate but not enable profiling other events like mispredicted branches and so on. Once we add those I fear the API just gets arbitrarily large, since there are so many different counters. And they vary by architecture.

We will take some time to read your doc, but we're not working as fast as normal these days. Thanks for your patience.

chabbimilind commented 4 years ago

@rsc , the concern around API growth is understandable. Hence, earlier in this thread, I proposed an alternative API that does not grow. It simply takes a string input and the string parsing remains internal to the implementation of the single API exposed. This will allow all kinds of counter names / values / configs being passed in as strings without needing any new APIs. Please do consider that alternative.

I cannot see how zero API change is possible. We need to maintain the OS interval timer as the default sampling engine to work across all platforms under all environments. The hardware performance counter-based sampling engine has to enabled by exposing at least one API.

chabbimilind commented 4 years ago

@ianlancetaylor @rsc @aclements @dvyukov @rhysh, this proposal is out for a while now. Please can we prioritize reviewing this?

chabbimilind commented 4 years ago

@ianlancetaylor @rsc @aclements @dvyukov @rhysh, another reminder to bring closure to this proposal.

networkimprov commented 4 years ago

There's a rather significant backlog of proposals, if you weren't aware.

In the meantime, feel free to post a link to your patchset on golang-dev & elsewhere and invite folks to try it. Also note that a lot of projects (including mine) patch the Go stdlib and/or runtime to make it work for their needs.

rasky commented 4 years ago

@chabbimilind thanks for your proposal and implementation. I'm excited to see this landing.

I don't understand why you think that a first version of the proposal cannot be landed with zero new API. The proposal doesn't explicitly state why (or I couldn't find a place where it does). It seems that the idea behind introducing the new API is to expose the full functionality of PMUs (eg: cache misses). If you drop that from the proposal (and leave it to a second iteration), it should well be possible to use the existing pprof API, and transparently use PMUs only on supported architectures. That would provide a real benefit for many users as-is, as far as I can tell.

In a previous comment, you say:

I cannot see how zero API change is possible. We need to maintain the OS interval timer as the default sampling engine to work across all platforms under all environments. The hardware performance counter-based sampling engine has to enabled by exposing at least one API.

I don't understand this point. I reckon you can keep the default sampling engine as a default for all architectures, and also use PMUs when they are supported (eg: Linux/x86). Is it technically impossible to keep both code paths in the runtime, and select the correct one at either compile-time or run-time? If so, why?

networkimprov commented 4 years ago

You may want to use either method; you can't meaningfully compare results from a PMU site and a non-PMU one.

But you probably could specify PMU config with a runtime environment variable, containing any number of parameters.

rasky commented 4 years ago

If you have a scenario where you absolutely need to be sure that PMU is used (and prefer the process to quit if it's not possible), you can use something like GODEBUG=pprof_backend=pmu, and the runtime could abort the process if it detects that PMU is not supported. The default of that variable would be GODEBUG=pprof_backend=auto, meaning "automatically select the best available backend on my system".

networkimprov commented 4 years ago

Yes, but see https://github.com/golang/go/issues/36821#issuecomment-597813810 re PMU config.

Default of "auto" sounds good, but we probably want an option to use the original method, instead of failing if a PMU isn't there.

chabbimilind commented 4 years ago

@rasky

If you drop that from the proposal (and leave it to a second iteration), it should well be possible to use the existing pprof API, and transparently use PMUs only on supported architectures. That would provide a real benefit for many users as-is, as far as I can tell.

you can use something like GODEBUG=pprof_backend=pmu

I concur with you in a transparent PMU CPU cycles profiling controlled by GODEBUG=pprof_backend=auto|pmu and have the cache-miss and other PMU events exposed through runtime/pprof API in the second iteration.

The detailed proposal and code changes were keeping the entire horizon of both iterations into consideration. However, if the reviewers' consensus is to do it piecemeal, I will be happy do so.

rasky commented 4 years ago

@chabbimilind I’m not in the review committee but both Russ and Ian have stated that the proposal would be easy to accept if it had zero API and not too much code. I think it’s pretty clear that if you modify the proposal to reflect the zero API, the proposal would be accepted. I’d be glad if you did it.

networkimprov commented 4 years ago

@chabbimilind assuming the reviewers are OK with a GODEBUG=pprof parameter, the proposal should probably offer:

GODEBUG=pprof=nopmu // use existing method
        pprof=auto  // use pmu if available (default)
        pprof_config=... // pmu config settings

What would be the value of a parameter that halts the program if a pmu isn't available?

chabbimilind commented 4 years ago

Since CPU profiling is only a part of pprof, I suggest more explicit naming.

GODEBUG=pprof_cpuprofile=ostimer // use existing method
        pprof_cpuprofile=pmu  // use pmu, panic and fail otherwise
        pprof_cpuprofile=auto  // use pmu if available, ostimer otherwise (default)

@networkimprov, what are you thinking about when you suggested pprof_config=?

I would like to bring another complication related to runtime.SetCPUProfileRate. What should someone expect when they call runtime.SetCPUProfileRate but the program is using PMU profiles? I really never want this API to have any influence on PMU profiling. @rsc seems to be in the process of deprecating this API, which is good.

But it leaves the question, how can someone set the CPU cycles interval (we don't talk in terms of 'rate' for PMU events). Interval, in PMU jargon, is the number of event counts elapsed between two consecutive sample points. We can choose a default interval such as 10Million cycles, which is effectively 100 samples per second on a 1GHz clock. Are you expecting GODEBUG=pprof_config= to provide a means to offer these knobs? If so, GODEBUG=pprof_config= can grow to be the default backdoor to avoid introducing an API; one can set GODEBUG=pprof_config=cachemisses@10000 to indicate the PMU to use cache misses events for sampling at 10K sampling interval. That also means we need to define the behavior of an unparsable config such as GODEBUG=pprof_config=junk.

However, how can a user override the interval/event once the program is already running (e.g., our production services run for several months without restarting)? During a single run, a performance investigation may want different sampling events/intervals based on their interest in an event and the granularity of observation. This use case was a key motivation of introducing an API.

In general, I encourage you all to think carefully about the consequences of GODEBUG=pprof environment variable and its consequences for the future where we'll really want to open an API for PMU profiling and when we do so how will it interoperate with the environment variable. However, if golang standards do not promise to maintain backward compatibility with the GODEBUG environment variable, we don't have to worry about the future for the first iteration where we'll not introduce any new API as per the on-going conversation.

networkimprov commented 4 years ago

Re pprof_config= (maybe pprof_pmu=) you suggested ConfigString() in https://github.com/golang/go/issues/36821#issuecomment-597813810 -- same idea. Let it set whatever you believe is essential for the first iteration of PMU profiles.

If there are eventually stdlib APIs that affect the PMU, I'd expect any GODEBUG inputs to take precedence. But you wouldn't have to use GODEBUG at that point. I do think GODEBUG input should keep working thereafter, for at least two releases. It looks like you should move the APIs to a second proposal, to be considered if this one lands.

Re runtime.SetCPUProfileRate, it could affect the PMU; I don't have an opinion there.

chabbimilind commented 4 years ago

The idea of GODEBUG= pprof_pmu is great. I am happy to hide the API exposure via this environment variable along the lines of my comment now modified to pass it via GODEBUG= pprof_pmu.

Also, honoring GODEBUG for two versions is a viable option.

I will wait for @rsc to pitch in the deprecation of runtime.SetCPUProfileRate and PMU disregarding anything coming down from runtime.SetCPUProfileRate.

Regarding the proposal itself, it would be a sizable overhaul, can I, instead, modify it in the following way, while retaining most of the current content.

Make it explicit in the implementation that it will be a 2-stage iteration.
Stage-1 will be zero API change with PMU configuration(s) being provided in the GODEBUG environment variable as discussed in the previous threads.
Stage-2 will expose the API.

networkimprov commented 4 years ago

cc @ianlancetaylor re proposal modification suggested above.

rhysh commented 4 years ago

Using a file descriptor per M and having the kernel send SIGPROF when it's readable looks clever. In programs that do blocking i/o (on files, for instance), each blocking operation consumes an OS thread. In programs that do a lot of blocking i/o, would this approach to profiling significantly increase the number of file descriptors (ulimit -n) that the program would need?

What does this CPU profiling mechanism measure when a program does only a small amount of work on each of a large number of OS threads? When sampling at 100Hz, I'd expect that 5ms of work sequentially on each of 100 threads (with runtime.LockOSThread) would result in 50 samples.

How do the event counters based on number of CPU cycles respond to CPU frequency scaling (for power management)?

Focusing only on improving the quality of the existing feature set for CPU profiling: How do the results from perf_event_open compare with the per-thread timers available via timer_create (from https://github.com/golang/go/issues/14434#issuecomment-256649735)? How much of the benefit when using perf_event_open is from sampling faster than 100 Hz?

How much of the benefit can this provide with zero new API of any kind, including environment variables?

From part 6.3.5 of the proposal: As an alternative, what possible runtime APIs would enable building most of this as a third-party package?

We explored library-level facilities to collect performance counters such as perf-util [3]. However, no library support will provide good profiling for Go since the runtime can schedule a monitored function onto another thread, where the counters may not be enabled. The correct solution, hence, is to implement the hardware performance counter facility inside the go runtime into the scheduler as we have proposed here.

chabbimilind commented 4 years ago

@rhysh , great questions. I have tried to answer as elaborately as possible.

In programs that do a lot of blocking i/o, would this approach to profiling significantly increase the number of file descriptors (ulimit -n) that the program would need?

Yes, in extreme cases, a file descriptor (FD) per OS thread can reach the ulimit -n limit. Failure to secure an FD with perf_event_open will not kill the program, the profiles of such thread would be silently missed. If there are too many transient IO threads, in profiles, we'll notice only what is statistically significant. Moreover, an OS thread that is not active on the CPU, will not invoke perf_event_open and hence will not consume a file descriptor. To me, this seems like a corner case and has mild consequences.

What does this CPU profiling mechanism measure when a program does only a small amount of work on each of a large number of OS threads? When sampling at 100Hz, I'd expect that 5ms of work sequentially on each of 100 threads (with runtime.LockOSThread) would result in 50 samples.

PMU counters are not running when the thread is not active. In your example, with 100 * 5ms = .5s of work, and sampling at 100 samples/sec, we'll get .5 sec * 100 samples/sec = 50 samples as you correctly noted.

How do the event counters based on number of CPU cycles respond to CPU frequency scaling (for power management)?

Almost all PMU counters measure work or events, not time. When using CPU cycles, which is the most commonly used PMU counter, the cycles tick slower with a slower clock (e.g., power management). Note that even using OS timer is not measuring the wallclock time -- it is measuring only the time when the process is running on the CPU.

Focusing only on improving the quality of the existing feature set for CPU profiling: How do the results from perf_event_open compare with the per-thread timers available via timer_create (from #14434 (comment))?

Using timer_create will eliminate one thread's elapsed time being attributed to another thread by ensuring that the signal handling thread is the same one whose timer expired. It will eliminate one source of "skid." It won't eliminate the skid associated with OS timer samples having only ~5ms granularity. Without an existing implementation, I do not have the means to quantify the expected improvements from timer_create in Golang. Notice that #14434 also uses perf_events to assess the ground truth, which in turn uses the CPU cycles PMU event. The serial code example in the proposal should convince you that per-thread timer alone will not fix the profile misattribution.

How much of the benefit when using perf_event_open is from sampling faster than 100 Hz? The answer to this question lies in the classic Nyquist frequency, which states that we should be sampling at twice the rate at which the events occur to capture the events. Using PMUs allows us to sample at microsecond granularity, when desired, and not be limited by OS timer's 5ms resolution. It is fair to say that most functions take <5ms and hence get misattributed with a 5ms timer. The serial code example in the proposal is intended to exemplify this case.

From part 6.3.5 of the proposal: As an alternative, what possible runtime APIs would enable building most of this as a third-party package?

I considered this marginally but rejected because I never thought golang runtime would expose a callback to a third-party library. If the go runtime is willing to expose callbacks to registered third-party libraries, the PMU sampling logic can live outside the runtime. This kind of callback hook is the very foundation of performance monitoring in other successful systems such as OpenMP Tools Interface and MPI Tools Interface (Chapter 14) . To answer your question, we need callback hooks at every goroutine schedule point where the current profiler enables or disables the OS timer. The golang runtime should also expose some means of identifying an OS thread because the third-party libraries need to maintain thread-specific information to enable, disable PMUs, and aggregate samples. This direction would be orthogonal to the current proposal but worthy of giving our thoughts. Note that, there will a lot of pitfalls in this route because the third-party library may not be able to execute all kinds of code when called from inside the runtime; allocation, GC, and locking issues need to be carefully thought out; the call stack unwinding from a third-party library will be tricky or has to depend on the runtime for doing so; and finally, profile data formatting into the pprof format etc. will have to be duplicated by third-party libraries.

rhysh commented 4 years ago

Thanks @chabbimilind

I don't understand the tie-in with the Nyquist criterion and the need for (in the discussion of an alternative) a callback hook at every goroutine schedule point. Are you using the profile results with go tool pprof, or with a different analysis mechanism that considers the order in which samples were observed? For example, I've used the --timestamp flag to perf record so I could find code that was executing sequentially every few minutes while holding particular locks, which would otherwise be lost in the noise of everything else the program was doing. Is that what you're working towards here, or is collecting a larger volume of data (over a longer period of time) good enough?

Failure to secure an FD with perf_event_open will not kill the program, the profiles of such thread would be silently missed.

It's probably good that the profiling code won't tear down the program, but if the program itself runs into the FD limit the outcome will be different. I don't know that it's a big risk, but it could make the release more difficult.

PMU counters are not running when the thread is not active. In your example, with 100 5ms = .5s of work, and sampling at 100 samples/sec, we'll get .5 sec 100 samples/sec = 50 samples as you correctly noted.

Do the counters track the CPU spend on each thread independently? If they're configured to roll over at 10ms worth of CPU time (to get 100Hz sample rate) but each thread only consumes 5ms, it seems like the result would be zero samples reported.

When using CPU cycles, which is the most commonly used PMU counter, the cycles tick slower with a slower clock (e.g., power management).

The current pprof toolset works based on seconds of CPU. Will the new code adjust for that, will it output with a different unit, or will it risk mis-scaled reporting?

The serial code example in the proposal should convince you that per-thread timer alone will not fix the profile misattribution.

I ran the test with the default setting of -m=100000 with today's tip version (074f2d800f). The program ran for about 0.3 seconds, and the 100Hz sample rate resulted in 29 samples in the CPU profile. I got the same level of spotty results you reported in the original issue description. I ran it again with -m=10000000 to do 100 times more work. It took about 30 seconds to run, and the results are much better. The profile captured 2864 samples, and their distribution matches the function names pretty closely (around 1% error).

How does that compare against the results of perf_event_open collecting 29 samples, or 2900 samples? Is the benefit of perf_event_open that it is able to sample more quickly, or that the samples it collects are more fairly distributed?

(Details from go tool pprof -top follow)

``` File: serial.linux_amd64 Type: cpu Time: Jul 24, 2020 at 12:42pm (PDT) Duration: 401.37ms, Total samples = 290ms (72.25%) Showing nodes accounting for 290ms, 100% of 290ms total flat flat% sum% cum cum% 80ms 27.59% 27.59% 80ms 27.59% main.E_expect_9_09 50ms 17.24% 44.83% 50ms 17.24% main.D_expect_7_27 40ms 13.79% 58.62% 40ms 13.79% main.B_expect_3_64 40ms 13.79% 72.41% 40ms 13.79% main.F_expect_10_91 40ms 13.79% 86.21% 40ms 13.79% main.J_expect_18_18 20ms 6.90% 93.10% 20ms 6.90% main.C_expect_5_46 10ms 3.45% 96.55% 10ms 3.45% main.A_expect_1_82 10ms 3.45% 100% 10ms 3.45% runtime.usleep 0 0% 100% 280ms 96.55% main.main 0 0% 100% 280ms 96.55% runtime.main 0 0% 100% 10ms 3.45% runtime.mstart 0 0% 100% 10ms 3.45% runtime.mstart1 0 0% 100% 10ms 3.45% runtime.sysmon ``` ``` File: serial.linux_amd64 Type: cpu Time: Jul 24, 2020 at 12:43pm (PDT) Duration: 28.77s, Total samples = 28640ms (99.54%) Showing nodes accounting for 28610ms, 99.90% of 28640ms total Dropped 5 nodes (cum <= 143.20ms) flat flat% sum% cum cum% 5230ms 18.26% 18.26% 5230ms 18.26% main.J_expect_18_18 4640ms 16.20% 34.46% 4640ms 16.20% main.I_expect_16_36 4160ms 14.53% 48.99% 4160ms 14.53% main.H_expect_14_546 3650ms 12.74% 61.73% 3650ms 12.74% main.G_expect_12_73 3140ms 10.96% 72.70% 3140ms 10.96% main.F_expect_10_91 2600ms 9.08% 81.77% 2600ms 9.08% main.E_expect_9_09 2080ms 7.26% 89.04% 2080ms 7.26% main.D_expect_7_27 1570ms 5.48% 94.52% 1570ms 5.48% main.C_expect_5_46 1030ms 3.60% 98.11% 1030ms 3.60% main.B_expect_3_64 510ms 1.78% 99.90% 510ms 1.78% main.A_expect_1_82 0 0% 99.90% 28610ms 99.90% main.main 0 0% 99.90% 28610ms 99.90% runtime.main ```

chabbimilind commented 4 years ago

@rhysh, I have replied to your queries below.

I don't understand the tie-in with the Nyquist criterion and the need for (in the discussion of an alternative) a callback hook at every goroutine schedule point

The reference to Nyquist sampling was to indicate that no sampling engine can capture patterns correctly if the events are happening in a system at least half as fast as the rate of observation. We can always write programs whose patterns will not be captured well with any sampling engine.

The reference to runtime callback/hooks was in response to your question “what possible runtime APIs would enable building most of this as a third-party package?” I was saying that your reference to keeping perf_events way of profiling as a third-party tool is similar to other successful performance analysis systems, however, I do not see an easy way to accomplish this in Go. Sorry for the digression, but, I was drawing parallels with the richness of performance analysis technology used in successful high-performance systems where understanding performance is a critical factor. In full generality, exposing runtime hooks to a profiling tool is perhaps the right direction to allow a rich set of tools to develop independently from the language runtime; however, this proposal is not the right place to discuss that exemplary direction.

Are you using the profile results with go tool pprof, or with a different analysis mechanism that considers the order in which samples were observed?

Yes, I use go tool pprof for PMU profiles (also).

is collecting a larger volume of data (over a longer period of time) good enough?

Depends on case by case. Each application is unique and developers have their preference of how long to monitor, what overhead is acceptable. In highly repetitive execution patterns, coarse-grained sampling suffices whereas in highly varying execution patterns, finer-granularity is needed. Using perf_events with PMUs will alleviate (not eliminate) several sources of imprecision and inaccuracy in the current sampling-based profiler in Go (ref. initial theoretical foundations of a sampling profiler). Empirical evaluations are showing the need to collect one to two orders of magnitude more samples with the OS timer compared to PMU cycles.

if the program itself runs into the FD limit the outcome will be different. I don't know that it's a big risk, but it could make the release more difficult.

To quantitatively address this concern, I have created a pathological program, which creates a configurable number of goroutines (controlled via -g=<int> command line flag) and each goroutine repeatedly opens, reads, and closes a unique file 1000 times (controlled via -i=<int> flag) and reads 100 bytes of data (controlled via -b=<int> flag) from the file.

I ran this program with no profiling as the baseline and with PMU cycles profiling (-p=true flag) and swept the number of goroutines over the range [512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]. The system under test is a 24-core Intel Xeon Gold 6136 CPU @ 3.00GHz (aka Skylake) processor, running Linux 5.3.0 kernel. ulimit -n shows 1024 FD limit. The go version used is the master from Feb 13. 2020 with PMU additions.

The table below (the numbers are averaged over ten runs in each configuration) compares the percentage error to open an FD by the user code when the PMU is enabled against the baseline of no profiling. In summary, perf_event_open file descriptors have negligible (0.13%) negative consequences even in the worst-case situation.

-----------------------------------------------------------------------------------------------
|                                          NUM GOROUTINES                                     |
|                    |  512 | 1024 | 2048 | 4096 | 8192 | 16K  | 32K  | 65K  |
-----------------------------------------------------------------------------------------------
|Baseline (no PMU)   | 0.00 | 0.00 | 2.90 | 9.37 | 13.67| 48.96| 74.92| 86.94|
|With PMU            | 0.00 | 0.13 | 1.62 | 2.88 |  6.29| 46.30| 72.92| 86.35|
-----------------------------------------------------------------------------------------------

More specifically, at 512 goroutines, there are no errors in either case; at 1024 goroutines, PMU has 0.13% file open errors whereas the baseline has no errors; and beyond 1024 goroutines, file open errors grow, as expected, irrespective of PMU being used or not. Surprisingly, when the failures happen in both settings, the file open errors are fewer with PMU sampling enabled — this is because, with profiling, the rate at which files are opened is a bit lower whereas it is higher (and hence more chance of running out of FDs) when there is no profiling.

If a program is already at the boundary of file descriptor limits, it would be imprudent on the programmer's part to have not addressed the failure to open a file. I am not aware of any official document that dictates that Golang shall not use a small fraction of file descriptors (or some other limited OS resources) for its internal use, especially when profiling.

As a final note, if it absolutely comes to that, which I highly doubt, we can set an internal limit on how many FDs the profiling system can use and on reaching the limit we can stop creating new FDs until the old perf_event_open FDs are closed.

Do the counters track the CPU spend on each thread independently?

Yes, the counters are per CPU core and virtualized for each OS thread.

If they're configured to roll over at 10ms worth of CPU time (to get 100Hz sample rate) but each thread only consumes 5ms, it seems like the result would be zero samples reported.

Yes, that is correct. Why should that be a problem? This situation should have consequences only at thread termination time. A sampling profiler is not designed to assure capturing such one-time events. I would trade a sampling engine (OS timer) that has a skid of ~5ms on each sample with a sampling engine (PMU) that misses its last sample.

The current pprof toolset works based on seconds of CPU. Will the new code adjust for that, will it output with a different unit, or will it risk mis-scaled reporting?

It will output a different unit: for example, CPU cycles will output “cycles” count, cache misses will output something “cache-misses” count, and so on. Pprof cleanly handles this already with the heap, mutex, block, cpu, etc. profile types. We should not try to re-scale cycles into seconds and confuse users with measuring something and reporting something else. Proposal section 5 has several examples that show Type: cycles.

How does that compare against the results of perf_event_open collecting 29 samples, or 2900 samples? Is the benefit of perf_event_open that it is able to sample more quickly, or that the samples it collects are more fairly distributed?

On the serial.go program example, using perf_event_open, PMU cycles profiling (this proposal) can produce the following highly accurate attribution with only 55 PMU cpu cycles samples with the default -m 100000 flag set to the program.

(pprof) top
Showing nodes accounting for 1100000000, 100% of 1100000000 total
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
 200000000 18.18% 18.18%  200000000 18.18%  main.J_expect_18_18
 180000000 16.36% 34.55%  180000000 16.36%  main.I_expect_16_36
 160000000 14.55% 49.09%  160000000 14.55%  main.H_expect_14_546
 140000000 12.73% 61.82%  140000000 12.73%  main.G_expect_12_73
 120000000 10.91% 72.73%  120000000 10.91%  main.F_expect_10_91
 100000000  9.09% 81.82%  100000000  9.09%  main.E_expect_9_09
  80000000  7.27% 89.09%   80000000  7.27%  main.D_expect_7_27
  60000000  5.45% 94.55%   60000000  5.45%  main.C_expect_5_46
  40000000  3.64% 98.18%   40000000  3.64%  main.B_expect_3_64
  20000000  1.82%   100%   20000000  1.82%  main.A_expect_1_82

At an even coarse-grained sampling, with only 36 PMU cpu cycles samples, the profiles become less accurate and very short functions (main.A_expect_1_82, main.B_expect_3_64) don’t appear (which is correct) as shown below.

(pprof) top
Showing nodes accounting for 1080000000, 100% of 1080000000 total
      flat  flat%   sum%        cum   cum%
 210000000 19.44% 19.44%  210000000 19.44%  main.I_expect_16_36
 180000000 16.67% 36.11%  180000000 16.67%  main.H_expect_14_546
 180000000 16.67% 52.78%  180000000 16.67%  main.J_expect_18_18
 120000000 11.11% 63.89%  120000000 11.11%  main.D_expect_7_27
 120000000 11.11% 75.00%  120000000 11.11%  main.G_expect_12_73
  90000000  8.33% 83.33%   90000000  8.33%  main.C_expect_5_46
  90000000  8.33% 91.67%   90000000  8.33%  main.E_expect_9_09
  90000000  8.33%   100%   90000000  8.33%  main.F_expect_10_91
         0     0%   100% 1080000000   100%  main.main
         0     0%   100% 1080000000   100%  runtime.main

In summary, using perf_events provided precise samples that the OS timer could not; and hence, the PMU approach did not require the nearly ~3K (52x more) samples you are showing using the OS timer where it reached the same accuracy.

networkimprov commented 4 years ago

each goroutine repeatedly opens, reads, and closes a unique file 1000 times

Perhaps a more likely scenario than unique-file-per-goroutine is all goroutines periodically reading or writing large payloads to a modest number of concurrently-open FDs, e.g. log files or TCP sockets. (Needing a ulimit-sized number of blocked OS threads for this case has independent problems, esp memory consumption.)

If profiling that could have ulimit issues, is there any way for your scheme to share its FDs?

chabbimilind commented 4 years ago

@networkimprov , even at the worst-case scenario, the profiler is not having any perceivable impact, should we still empirically assess the less severe case you are suggesting?

networkimprov commented 4 years ago

If I understood your "pathological program" correctly, it doesn't test the case that @rhysh raised, which is:

a) a few FDs opened by the app a) many goroutines making blocking syscalls, created by the app c) many OS threads waiting on syscalls, created by the runtime d) an FD for each OS thread, created by the profiler

chabbimilind commented 4 years ago

@networkimprov , if I understand your scenario correctly, it can only create problem in successfully opening the perf_event system but will not affect the application code because the application FDs are already open. My understand of @rhysh 's response was whether the application code itself be impacted because of the profiler consuming some file descriptors which my example addressed. In anycase, if you have another example in mind, feel free to share an example code, and I will run it with the PMU profiler.

networkimprov commented 4 years ago

I'm only trying to help you move this forward; it seems promising.

will not affect the application code because the application FDs are already open

Is it safe to assume that all such apps would never need any other FDs? If not, then my Q stands: is there any way for your scheme to share its FDs?

chabbimilind commented 4 years ago

@networkimprov , it is not possible to share perf_event_open FDs, but we can limit the number of FDs. Is there a way to distinguish whether an OS thread is an IO/syscall thread? If so, we can discard IO-waiting threads from even being considered for perf_event monitoring. Do IO threads make it to the execute() point? if not, we don't have to worry about IO threads consuming FDs.

On a related topic, it is not uncommon to exclude certain threads from being monitored in other profiling systems -- e.g., MPI progress thread in MPI runtimes, CUDA runtime threads in CPU-GPU codes, to name a few.

rhysh commented 4 years ago

The cost of using an additional file descriptor per M is a risk, but it'll be up to a runtime package owner to decide how important it is. It might be a show stopper, it might be an non-issue. @aclements , what's your view on this?

Threads waiting on syscalls (including IO) can still spend CPU cycles in the kernel, which means their stacks (at least the user-space portion that's visible to us) should be counted. The number of threads waiting on syscalls can easily outnumber the file descriptors a program uses. I've included a demo of this, centered around os.NewFile(os.Stdin.Fd(), "/dev/stdin") to allow multiple goroutines to wait for data in the kernel. (It's contrived, but there are certainly other ways to end up with lots of threads without spending many fds.) The demo also runs an http server; requests to that server should succeed even when profiling is active, even when accepting a new connection would result in creating a new fd. It's a demo to help show how the proposal can affect programs in a way it didn't before, it's not proof of how important that case is.

The current profiler seems to also under-sample work done on short-lived threads. I opened a separate issue for that, #40435. The main problem is when threads which only do a half sample worth of work have that rounded down to zero.

How does the perf_event_open profiler deal with threads that aren't created by the Go runtime? I don't know much about this case, but my understanding is that if a program forks/clones a new thread and keeps the Go-installed signal handler in place (?), that CPU time spent on that thread would be reported in the profile that the runtime/pprof package writes. See for example https://github.com/golang/go/issues/35057#issuecomment-546546958 .

The reference to Nyquist sampling was to indicate that no sampling engine can capture patterns correctly if the events are happening in a system at least half as fast as the rate of observation. We can always write programs whose patterns will not be captured well with any sampling engine.

What patterns would the profiles capture? Most tools for working with CPU profiles use total counts, so any time-based patterns would be lost. I agree that Go's built-in profiler should report main.E_expect_9_09 as taking close to 9.09% of the program's CPU time. I don't think it's fair to expect the built-in profiler to show that calls to main.E... follow calls to main.D... and are followed by calls to main.F.... A very high-frequency profiler would be able to show that (if it meets the Nyquist criterion and samples faster than a call to main.D...), but that seems like a significantly different product than what runtime/pprof is today.

I agree that the current profiler (on Linux at least) needs to improve. Comparing 100 profiles from the serial.go test case, I see that for every function, there are a significant number of profiles that report zero samples for it. Each run of serial.go executes the functions in order 100 times, which should be enough to give each function a fair shot at catching a SIGPROF. I'd expect the number of samples of a function in a profile to be roughly a Poisson distribution.

Here are the results for the "E" function, which got 241 out of 2782 samples in total across 100 runs of the program. I'd expect the mean to be 2782*(5/55)/100 = 2.529 samples, and as a Poisson distribution for the variance to be the same at 2.529. Instead, the mean is 2.41 (ok) and the variance is 7.60 (big!). The distribution looks somewhat bimodal. Weird! The large variance seems like a bug, and might be the core of the need for "better" profiling.

$ for i in {1..100}; do ./serial.linux_amd64 && mv serial_prof serial_prof.for.$i ; done
[snip]

$ for i in {1..100} ; do (go tool pprof -focus E_expect -raw /tmp/serial_prof.for.$i 2> >(grep -q 'no samples' && echo -n -e 'Samples\n0\nLocations\n')) | sed -n -e '/Samples/,/Locations/ p' | awk '{if ($1~/[0-9]/) {s+=$1}} END {print s}' ; done | dist
N 100  sum 241  mean 2.41  std dev 2.75643  variance 7.59788

     min 0
   1%ile 0
   5%ile 0
  25%ile 0
  median 1
  75%ile 4
  95%ile 8
  99%ile 9
     max 9

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠒⠒⠒⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.301
⠀⠀⠀⠀⠀⠀⢀⡴⠋⠀⠀⠀⠀⠀⠀⠀⠈⠲⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠒⠒⠒⠒⠒⠋⠉⠉⠉⠉⠉⠁⠉⠉⠉⠉⠋⠉⠉⠉⠉⠓⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠁
           0                           5                         10
$ for i in {1..100} ; do (go tool pprof -focus E_expect -raw /tmp/serial_prof.for.$i 2> >(grep -q 'no samples' && echo -n -e 'Samples\n0\nLocations\n')) | sed -n -e '/Samples/,/Locations/ p' | awk '{if ($1~/[0-9]/) {s+=$1}} END {print s}' ; done | sort -n | uniq -c
  41 0
  13 1
   7 2
   4 3
  11 4
   6 5
   7 6
   4 7
   5 8
   2 9

many-threads demo code

``` package main import ( "flag" "io" "log" "net/http" "os" "time" // Install some pages on the default mux _ "expvar" _ "net/http/pprof" ) func main() { readers := flag.Int("readers", 10, "Number of file-reading goroutines") addr := flag.String("http", "localhost:8080", "Address to bind for debug server") flag.Parse() go func() { err := http.ListenAndServe(*addr, http.DefaultServeMux) if err != nil { log.Fatalf("ListenAndServe; err = %v", err) } }() go func() { for i := 0; i < *readers; i++ { go func() { stdin := os.NewFile(os.Stdin.Fd(), "/dev/stdin") b := make([]byte, 1) _, err := stdin.Read(b) if err != nil { log.Fatalf("Read; err = %v", err) } }() } }() task, err := os.Open("/proc/self/task/") if err != nil { log.Fatalf("Open; err = %v", err) } defer task.Close() fd, err := os.Open("/proc/self/fd/") if err != nil { log.Fatalf("Open; err = %v", err) } defer fd.Close() for i := 0; ; i++ { if i > 0 { time.Sleep(1000 * time.Millisecond) } tasks, err := task.Readdirnames(0) if err != nil { log.Fatalf("Readdirnames; err = %v", err) } fds, err := fd.Readdirnames(0) if err != nil { log.Fatalf("Readdirnames; err = %v", err) } log.Printf("task=%d fd=%d", len(tasks), len(fds)) _, err = task.Seek(0, io.SeekStart) if err != nil { log.Fatalf("Seek; err = %v", err) } _, err = fd.Seek(0, io.SeekStart) if err != nil { log.Fatalf("Seek; err = %v", err) } } } ```

chabbimilind commented 4 years ago

@rhysh , I ran your code example with the PMU profiler and I have made two observations.

First, no new perf_event FDs are created for syscall threads when a thread is already blocked on a syscall. After all it is blocked, we can't start its profiler.

$ go run manyThreads.go -readers=100
2020/07/27 23:30:19 task=7 fd=9
2020/07/27 23:30:20 task=105 fd=9
2020/07/27 23:30:21 task=105 fd=9
2020/07/27 23:30:22 task=105 fd=9
2020/07/27 23:30:23 task=105 fd=9
2020/07/27 23:30:24 task=105 fd=9
2020/07/27 23:30:25 task=106 fd=10 <=== PMU profiler started here via wget host:8080/debug/pprof/profile?seconds=5\&eventName=cycles\&eventPeriod=100000  (one FD for profile file)
2020/07/27 23:30:26 task=106 fd=10
2020/07/27 23:30:27 task=106 fd=10
2020/07/27 23:30:28 task=106 fd=10
2020/07/27 23:30:29 task=106 fd=10 <=== perf_event PMU profiler finished 
2020/07/27 23:30:30 task=111 fd=9 
2020/07/27 23:30:31 task=111 fd=9
2020/07/27 23:30:32 task=111 fd=9
2020/07/27 23:30:33 task=111 fd=9

However, if I change your code and inject pprof.StartCPUProfileWithConfig(pprof.CPUCycles(file, period)) into the program before the loop that creates FDs, os.NewFile(os.Stdin.Fd(), "/dev/stdin"), we can see that each OS thread gets an FD as shown below.

go run manyThreads.go -readers=100
2020/07/27 23:28:07 task=6 fd=13
2020/07/27 23:28:08 task=106 fd=112 <== one FD per OS thread.
2020/07/27 23:28:09 task=106 fd=113
2020/07/27 23:28:10 task=106 fd=113
2020/07/27 23:28:11 task=106 fd=113
2020/07/27 23:28:12 task=106 fd=113
2020/07/27 23:28:13 task=106 fd=113
2020/07/27 23:28:14 task=106 fd=113
2020/07/27 23:28:15 task=106 fd=113
2020/07/27 23:28:16 task=106 fd=113
2020/07/27 23:28:17 task=106 fd=113
2020/07/27 23:28:18 task=106 fd=113
2020/07/27 23:28:19 task=106 fd=113
2020/07/27 23:28:20 task=106 fd=113
2020/07/27 23:28:21 task=106 fd=113
2020/07/27 23:28:22 task=106 fd=113
2020/07/27 23:28:23 task=106 fd=113
2020/07/27 23:28:24 task=106 fd=113
2020/07/27 23:28:25 task=106 fd=113

Finally, if I set -readers=1024, (ulimit -n is 1024)) and then try to connect to the HTTP server, an error happens

2020/07/27 23:37:23 http: Accept error: accept tcp 127.0.0.1:8080: accept4: too many open files; retrying in 5ms.

Formally, let N be the maximum number of file descriptors; if PMU profiling starts at time T_s and stops at time T_e and if the sum of the number of already open file descriptor and the number of active OS threads between T_s and T_e exceeds N, then a) PMU-based profiling will not monitor the OS threads over and above N and b) the act of profiling will alter the program behavior if it attempts to create more file descriptors in the T_s - T_e time window.

I do expect the net/http to be the preferred way of profile collection.

rhysh commented 4 years ago

Thanks for confirming, @chabbimilind . If an M is busy doing a syscall when profiling starts, will it start profiling itself when it returns, or will work on that M not be counted? Is there a way for a thread to enable/disable profiling on another thread, or does each thread need to do all of its own work? (It looks like some control is possible via ioctl and, in a pinch, prctl.) Compare with timer_create plus CLOCK_THREAD_CPUTIME_ID, where the initial setup must take place on the thread, but the timer can be enabled/disabled from any thread in the process (iirc).

I wonder also if the runtime could use perf_event_open with the inherit bit set when the process starts up (and has only one thread) to get a view into all of the process's threads. For one, it could use PERF_RECORD_FORK events to build a list of all threads, including those not created by the Go runtime. For another, maybe that enables all cpu usage data to be written to one data stream (one FD), and for the reader of that data stream to either do the profiling itself (including looking up the goroutine labels that were active at the time .. might be very tricky) or to dispatch a SIGPROF to the responsible thread for it to re-profile itself (might introduce unacceptable processing delays for each sample, even when done within a signal handler).

That might cost a file descriptor for the lifetime of the process, and might cost an additional thread whenever profiling is active. The benefit would be that it would observe all threads, and would not need O(M) or O(NumCPU) FDs. Does that show any promise? Can all threads on all CPUs in a process report to a single FD, and does it work well?

chabbimilind commented 4 years ago

@rhysh , first, I greatly appreciate your feedback and helping refine this work. I have replied below.

If an M is busy doing a syscall when profiling starts, will it start profiling itself when it returns, or will work on that M not be counted?

If a blocked thread gets unblocked and executes the runtime.execute() function, and if profiling gets enabled in the intervening period, the thread will be configured with perf_event_open, and it will consume an FD. Also by the same logic, if the profiling gets disabled between the time the thread got blocked to the time it executes execute(), we'll release its FD.

Is there a way for a thread to enable/disable profiling on another thread, or does each thread need to do all of its own work?

It is possible to start counters of another thread: first, by setting the correct thread-id (pid) field in perf_event_open and then later, it can be enabled or disabled by using ioctl.

I wonder also if the runtime could use perf_event_open with the inherit bit set when the process starts up (and has only one thread) to get a view into all of the process's threads.

It is possible; if you see how perf record works, it does exactly the same, it sets the inherit bit. However, it does not solve the problem. The inherit bit is meaningful for only the threads that are created after calling perf_event_open. For the threads that already exist in the monitored process, we need to call perf_event_open on each one of them. This is what perf record does in the attach mode.

We can go with the option of calling perf_event_open during go runtime initialization (which you mention below) and burn one FD, but it is not general enough because cycles is not the only counter we want to measure going forward. I am thinking ahead into the next rev of the proposal where we want to give the ability to choose the event that the developer wants to measure, which can be cache misses, etc.

For one, it could use PERF_RECORD_FORK events to build a list of all threads, including those not created by the Go runtime.

Nice idea. Although, that is not necessary. When the profiling is switched on, we can look up /proc/self/tasks to get the list of TIDs.

or to dispatch a SIGPROF to the responsible thread for it to re-profile itself (might introduce unacceptable processing delays for each sample, even when done within a signal handler).

This will defeat the purpose of an accurate profiler. The dispatch will cause enough skid (the thread under question would not have been paused yet) that the samples will become less and less reliable. The signal dispatching opens the Pandora’s box of one thread being able to stop another thread and taking a call stack unwind of another thread or stopping the world etc.

That might cost a file descriptor for the lifetime of the process, and might cost an additional thread whenever profiling is active. The benefit would be that it would observe all threads, and would not need O(M) or O(NumCPU) FDs. Does that show any promise? Can all threads on all CPUs in a process report to a single FD, and does it work well?

I answered this question before. It works fine if a) profiling is ON from the program start and b) only one event type is monitored.

We should also consider the case where Go code is dynamically linked with another language, say C (like here) and the C code already starts several threads before letting the Go runtime start its one initial profiler thread; that is, the already created threads will not be monitored.

PS:

FWIIW, I am not trying to shoot down the ideas, but just enumerating the possible issues with the alternatives.

I do want to let you know that in the single profiler thread approach, we don’t have to dispatch SIGPROF to other threads in the case of perf_event_open called with the inherit flag. We can get the call stack of the eventing thread from the kernel itself (PERF_SAMPLE_CALLCHAIN flag) in a ring buffer, which can also include the kernel stack if desired and security permissions enabled. In my proposal and implementation, I did not use the kernel-provided backtrace because I wanted to dovetail as much as possible with the existing profiling system.

Finally, a lot of design discussion seems to be going in the direction of overfitting to Linux and may not translate into other systems, if we ever want to bring PMU to other systems.

golang / go

proposal: runtime/pprof: add PMU-based profiles #36821

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

test 1 (parallel)

test 2 (serial)

What did you expect to see?

What did you see instead?

test 1 (parallel)

goroutine.go/Run 1:

goroutine.go/Run 2:

goroutine.go/Run 3:

test 2 (serial)

serial.go/Run 1:

serial.go/Run 2:

serial.go/Run 3:

Improved results with PMU-based profiling.

goroutine.go/Run 1:

goroutine.go/Run 2:

serial.go/Run 1:

serial.go/Run 2:

Dependence on the number of cores and length of test execution:

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?