proposal: runtime/pprof: cross system stack transitions in the heap profiler

Proposal Details

Summary

I propose that the heap profiler cross system stack transitions in tracebacks, to be consistent with the other profilers.

The user-visible changes would be:

For heap allocations on the systemstack we would see the user stack leading to the allocation
For the same allocations, we would no longer see the runtime frames (since the heap profiler hides them)

Background

The runtime profilers are inconsistent in how they handle system stack transitions in tracebacks. Given a sequence of calls like this:

main.main                  <--+
main.foo                      +-- User portion
runtime.bar                   |
runtime.systemstack_switch <--+

runtime.systemstack        <--+
runtime.bar.func1             +-- System portion
runtime.interestingEvent      |
runtime.recordEvent        <--+

The profilers report a traceback like so:

The CPU profiler shows both the system and user portion of the traceback
The block and mutex profilers show the user portion of the traceback
- The recently-added runtime lock profiling shows the system and user portions
The runtime execution tracer shows the user portion of the traceback
The heap profiler shows only the system portion of the traceback, when the sampled allocation happens on a system stack

As a rule of thumb, I think we want the entire sequence of calls leading up to the event of interest, possibly excluding implementation details at the end of the sequence. More often than not, the user portion of the traceback is the most informative as a developer.

The heap profiler is the only one which won't show the user portion of the stack consistently. We see this in practice, for example, when starting a new goroutine requires allocating a new g. Today we'd see a traceback leading from runtime.systemstack to runtime.malg, but we wouldn't see the user portion of the call stack leading to the go statement. Note that under this proposal we wouldn't see the system stack frames after the go statement, because the heap profiler elides runtime frames from the end of tracebacks. (Source)

This is in part motivated by trying to use frame pointer unwinding for more of the runtime profilers, see https://go.dev/cl/540476. Naive frame pointer unwinding isn't going to know whether or not it's crossing the systemstack transition. Either of crossing the transition or just capturing the user portion of the call stack would be much more straightforward to match with frame pointer unwinding than only capturing the system portion.

cc @golang/runtime @prattmic

golang / go