Open dvyukov opened 8 years ago
That could work, and be much faster. My only worry is that at some point we're going to tackle the "inline non-leaf functions" problem and then just the list of PCs from the frame pointer walk won't be enough. We'll need to somehow expand PCs that correspond to call sites that have been inlined into other functions. I'm not sure how that would work without doing everything the generic gentraceback is doing (findfunc, mainly).
There is no need to handle inlining at traceback time. Today, each PC corresponds to a single file/line/function. When we can inline non-leaf functions, each PC corresponds to a list of file/line/function tuples. At traceback time, we only need that PC. At symbolization time, we need to expand that PC into a list of file/line/function tuples. There is already code for handling this in runtime.Frames.Next
, we just need to move from FuncForPC
to something that can return multiple file/line/function tuples.
The point is, traceback can be fast while still supporting non-leaf inlined functions when we interpret the traceback.
CL https://golang.org/cl/33754 mentions this issue.
CL https://golang.org/cl/33895 mentions this issue.
CL 43150 may also help speed up tracing; that list of hot functions looks familiar from when I was working on that CL.
Change https://golang.org/cl/33809 mentions this issue: runtime: use frame pointers for callers
@danscales
Change https://golang.org/cl/212301 mentions this issue: runtime: use frame pointers to implement runtime.callers()
@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301
But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames?
One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done).
Hi @danscales,
The most profitable and critical is to optimize tracer, maybe mprof. Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs. I've added a bunch of tests for tracer that check that resulting stacks capture precisely what we want, including exact behavior of skip arguments. These should be useful, you could add more if you see other tricky cases.
40% overhead currently [like your 55% overhead]
Either standard unwinding become faster, or everything else become slower :)
and 6% overhead if we don't have to deal with inlined frames
This sounds great. I would rely on tests that we correctly deal with inlining/skip.
Thanks!
The most profitable and critical is to optimize tracer, maybe mprof. Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.
The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and the skipped frames must be counted including inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization.
But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests.
We don't need these strict skip semantics here. We need to remove the non-interesting frames, how exactly does not matter. So instead of removing N inlined frames, we can remove M non-inlined frames. I would assume that the relevant runtime functions should not be inlined, so doing correct skip should be doable. Alternatively, we could consider changing what's being stripped a bit, if it makes things much simpler and faster on the unwinding side.
I uploaded CL 463835 which is similar to CL 33809 and tested it against BenchmarkPingPongHog
.
Before:
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
│ baseline.txt │ gentraceback.txt │
│ sec/op │ sec/op vs base │
PingPongHog-72 642.8n ± 4% 5616.5n ± 3% +773.82% (p=0.000 n=10)
After:
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
│ baseline.txt │ fp.txt │
│ sec/op │ sec/op vs base │
PingPongHog-72 642.8n ± 4% 834.8n ± 1% +29.87% (p=0.000 n=10)
I've also tested it against the sweet benchmarks and also got very good results, see gist.
I suspect some changes from CL 212301 would have to be absorbed to make this more robust, but hopefully this could be done without losing performance (I missed this CL when I started this work, otherwise I would have probably used it as my starting point).
Anyway, I'm willing to continue this work towards a production-ready patch. As far as the discussion above is concerned, I'm voting to make runtime/trace unwinding best effort: The more it can resemble gentraceback, the better. But small deviations are acceptable when it has a significant performance/simplicity benefits.
Some more information is available in this blog post (ignore the first half).
Change https://go.dev/cl/463835 mentions this issue: runtime: frame pointer unwinding for tracer
Change https://go.dev/cl/474916 mentions this issue: runtime: use regular unwinding for trace events from cgo callbacks
Change https://go.dev/cl/475795 mentions this issue: runtime: add a benchmark of FPCallers
@felixge it looks like one of these CLs is causing failures on solaris on https://build.golang.org/
https://build.golang.org/log/3ed4fadb9027af0908e4ca04ea104b69651ad68d
@prattmic tracking this in https://github.com/golang/go/issues/59350 now. Will try to fix ASAP.
Change https://go.dev/cl/481617 mentions this issue: runtime: improve getcallerfp comment
Change https://go.dev/cl/494857 mentions this issue: runtime: rename getcallerfp to getfp
Traceback is the main source of slowdown for tracer. On net/http.BenchmarkClientServerParallel4: BenchmarkClientServerParallel4-6 200000 10627 ns/op 4482 B/op 57 allocs/op with tracer: BenchmarkClientServerParallel4-6 200000 16444 ns/op 4482 B/op 57 allocs/op That's +55% slowdown. Top functions of profile are: 6.09% http.test http.test [.] runtime.pcvalue 5.88% http.test http.test [.] runtime.gentraceback 5.41% http.test http.test [.] runtime.readvarint 4.31% http.test http.test [.] runtime.findfunc 2.98% http.test http.test [.] runtime.step 2.12% http.test http.test [.] runtime.mallocgc
runtime.callers/gcallers/Callers are not interested in frame/func/sp/args/etc for each frame, they only need PC values. PC values can be obtained using frame pointers, which must be much faster. Note that there calls are always synchronous (can't happen during function prologue or in the middle of goroutine switch), so should be much simpler to handle.
We should use frame pointers in runtime.callers.
@aclements @ianlancetaylor @hyangah