proposal: runtime/pprof: add goroutine stack memory usage profile

Proposal Details

Summary

I'm proposing to implement a new profile type that allows to break down goroutine stack space usage by frame.

No new API. Added as a new sample type to the goroutine profile
The value for each stack trace is the sum of space for the leaf frame
Free stack space is indicated via a virtual runtime._FreeStack leaf node
The grand total should be equal (or close) to /memory/classes/heap/stacks:bytes
A rough prototype CL is available here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests)

Given the above, perhaps this is small and simple enough to skip the official proposal process. But since the design includes a few opinionated choices, it's probably best to have some discussions and consensus upfront.

Motivation

My main motivation for this came from debugging a case of dramatic stack space growth while deploying PGO to production (#65532) which I was able to root cause using a hacky stack space profiler that I implemented in userland (repo).

Additionally I imagine this profile type will be useful for other scenarios, e.g. high cpu usage in morestack (#18138).

Implementation

I'm proposing to implement this profile type by taking each stack trace in the goroutine profile and looking up it's frame size (❶ shows this for a single stack trace). Then each stack trace is broken into one stack trace per prefix (from the root), and these stack traces are assigned the frame size of their leaf frames as values (❷). This will produce a flame graph where the "self value" of each frame corresponds to its frame size, and its total value corresponds to its frame size plus the frame sizes of its children (❸).

These values are then multiplied by the number of goroutines that were captured for the given stack trace, resulting in the sum of stack space usage.

Last but not least, a runtime._FreeStack leaf node is added to capture the delta between the stack space used by frames, and the total size of the stack allocated for the goroutine. Additionally a root-level runtime._FreeStack is used to show the amount of memory reserved for goroutine stacks that is currently not in use. These virtual frames are motivated by producing a profile that adds up to /memory/classes/heap/stacks:bytes as well as giving the user the ability to reason about potential morestack issues.

Prototype

I have uploaded a rough CL for a prototype here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests).

Using this prototype we can look at a real stack profile for a program with the following goroutines:

1 goroutine with a oneThousand byte frame
1 goroutine with a twoThousand byte frame
2 goroutines with a threeThousand byte frame

Code Snippet

```go func launchGoroutinesWithKnownStacks() func() { c1 := make(chan struct{}) c2 := make(chan struct{}) c3 := make(chan struct{}) c4 := make(chan struct{}) go oneThousand(c1) go twoThousand(c2) go threeThousand(c3) go threeThousand(c4) <-c1 <-c2 <-c3 <-c4 // hacky way to ensure all goroutines reach the same <-ch statement // TODO(fg) make caller retry in the rare case this could go wrong time.Sleep(10 * time.Millisecond) return func() { c1 <- struct{}{} c2 <- struct{}{} c3 <- struct{}{} c4 <- struct{}{} } } //go:noinline func oneThousand(ch chan struct{}) [1000]byte { var a [1000]byte ch <- struct{}{} <-ch return a } //go:noinline func twoThousand(ch chan struct{}) [2000]byte { var a [2000]byte ch <- struct{}{} <-ch return a } //go:noinline func threeThousand(ch chan struct{}) [3000]byte { var a [3000]byte ch <- struct{}{} <-ch return a } ```

2024-03-27 pprof test stack_space at 20 27 34@2x

Note: The prototype doesn't implement the proposed root-level runtime._FreeStack frame yet.

Performance

I not measure this yet, but I suspect all of this can be done with negligible impact on the overhead of the goroutine profile.

Next Steps

Please let me know what you think. cc @prattmic @mknyszek @nsrip-dd @rhysh (this was previously discussed in a recent runtime diagnostics sync, see notes).

I think the overall idea, thanks for filing the proposal.

One minor "knee-jerk" reaction I have from looking at the example image is that the display of the _FreeStack frame is a bit odd. At first glance, it looks like the chanrecv1 path is "using" a lot of stack space, when in reality it isn't really related to stack use, it just happens to be where the goroutine is parked, and thus where the frame gets attached.

That said, I can't immediately think of how to display it better. The fundamental problem here seems to be the aggregation by common stack frames complicating things.

Ways I can think of to "fix" that would be:

Force the tooling to avoid stack aggregation. i.e., every goroutine is displayed on its own with a fake root-level "goroutine N" frame. Under that frame would be two nodes: one with the actual stack frames, and one for _FreeStack. I think this would make individual goroutines easier to comprehend, at the expense of making it more difficult to see aggregate effects (e.g., I have 128 duplicate worker goroutines that in aggregate use lots of stack in a certain frame).
Avoid the per-goroutine _FreeStack frame altogether and group extra stack space in the root level _FreeStack frame. This is a loss of data, though it's not totally clear to me how useful that data is. One useful aspect I can imagine would be learning that you have a decent amount of headroom to increase stack size before actually growing the stack.

Thanks for the feedback @prattmic. I've also added this as a potential agenda item for tomorrow's runtime diagnostics sync as it might be easier to hash out some of these ideas in real time.

One minor "knee-jerk" reaction I have from looking at the example image is that the display of the _FreeStack frame is a bit odd. At first glance, it looks like the chanrecv1 path is "using" a lot of stack space, when in reality it isn't really related to stack use, it just happens to be where the goroutine is parked, and thus where the frame gets attached.

Yeah, I agree. It is confusing.

(But I disagree about this being misleading about reality. The free stack memory truly sits below gopark and the self size of gopark is correct. But yeah, reality is kinda unintuitive here, I agree with that : p)

That said, I can't immediately think of how to display it better. The fundamental problem here seems to be the aggregation by common stack frames complicating things.

Ways I can think of to "fix" that would be:

Force the tooling to avoid stack aggregation. i.e., every goroutine is displayed on its own with a fake root-level "goroutine N" frame. Under that frame would be two nodes: one with the actual stack frames, and one for _FreeStack. I think this would make individual goroutines easier to comprehend, at the expense of making it more difficult to see aggregate effects (e.g., I have 128 duplicate worker goroutines that in aggregate use lots of stack in a certain frame).

Avoid the per-goroutine _FreeStack frame altogether and group extra stack space in the root level _FreeStack frame. This is a loss of data, though it's not totally clear to me how useful that data is. One useful aspect I can imagine would be learning that you have a decent amount of headroom to increase stack size before actually growing the stack.

Good ideas. What about a variant of version 1 where the _FreeStack is added below the root frame of the goroutine, but without having one root node for every goroutine. Below is a screenshot of what this would look like:

2024-03-27 pprof test stack_space at 21 08 59@2x

What do you think?

Hm, I do like that better, but the ordering still feels slightly inverted to me (it seems odd that the free stack is the second frame, when the first frame is using real stack space). But this gives me another idea:

What if we invert the idea of "free stack space" and instead display total allocated stack size? Concretely, each stack ends at a "[total]" frame, whose value is the total allocated stack size. In a flame graph, you'd end up with something that looks like this (pardon the ASCII art):

------------------------------------------------
|                  root                        |
------------------------------------------------
|                  [total]                     |
------------------------------------------------
          |   threeThousand  | twoThousand     |
          --------------------------------------

The empty space in the bottom left is the total free stack space in the program.

This sounds a bit like my suggestion (2) above because we can't see the free space per goroutine type, but I'm not certain we have to lose this information. In theory, running with -focus threeThousand should give us just the threeThousand goroutine portion, and I would want that to display a "[total]" entry with the total only from the threeThousand goroutines.

I'm not 100% sure this would work out of the box though given the slightly strange way we are exploding samples. We may need to use unique fake PCs per goroutine (0x1d0000001, 0x1d0000002, etc), and even then I'm not certain that -focus would behave the way we want.

I think if I saw I profile like that, I'd soon want to travel back in time to see which call stack the goroutine had when the runtime determined the stack needed to be grown. But it looks like the current proposal is to tie information to the goroutine's current stack.

I'd expect a CPU profile looking at morestack would give some hints, but that seems similar to the work that's been required to hunt down runtime-internal lock contention: you need to catch it in the act, and you need to make assumptions about the recent behavior being representative of the historical behavior.

To be concrete, consider an HTTP server where: each request is handled quickly, some requests result in deep call stacks, and all arrive on often-idle http/1.1 keep-alive connections. It sounds like the current proposal would show a lot of goroutines with net/http.(*conn).serve but not ...ServeHTTP on the stack, some of which have large amounts of runtime._FreeStack, but without a clear pointer to which part of the http.Handler was responsible for the growth.

I'm not 100% sure this would work out of the box though given the slightly strange way we are exploding samples.

@prattmic I like this idea, but I don't think -focus will do what we need it to do. Even with fake PCs.

The problem is that the "empty space" below a node is created by a stack trace that only goes up to that node. In my example from above, the "empty space" below a is created by a stack trace that just contains the frame a.

So given the definition of focus below, I think -focus b will match the stacks a;b;c, a;b but not a (which would be equivalent to the total node in your example.

   -focus           Restricts to samples going through a node matching regexp

Looking at the other pprof options neither -show (only applicable to graph view?) nor -show_from would do what we need. In theory we could implement a new -focus_parents option that suits our needs. But I'm still unsure if this could work with your fake pc idea. Because by default the flame graph will be aggregated on a function-level, so the individual pc values should be considered for any operations 🤔.

Anyway, I also realized a few other areas where stack memory can hide:

System Goroutines (excluded from goroutine profile)
g0 and gsignal stacks on Ms

Tracking this as runtime._FreeStack on the root level would be misleading. So we would probably need at least runtime._StackFree (for the stack pool) and runtime._StackSystem to account for these areas of stack usage separately.

Of course we could also consider to just not deal with any of this, and only show stack memory usage by the frames that are currently on the stack. But this would leave an annoyingly large gap with /memory/classes/heap/stacks:bytes unless we decide to split this into two or more metrics. However, the advantage of this is that it would be more similar to the heap_inuse profile where we also only show the live memory usage without the free space (dead objects + free heap + unused heap). This would have been good enough for my analysis in #65532.

@rhysh I see your point. I'm implicitly biased towards continuous profiling where I'm assuming the user has access to many stack profiles and the ability to aggregate them. In practice this should allow the user to see stack profiles that include the frames below morestack in the CPU profile, thous breaking down their contribution to triggering the growth.

Do you think that's good enough?

I mean I could also see the value in a dedicated morestack profile, but I guess that's a secondary use case for me. The primary use case is analyzing changes in stack memory usage that don't correspond to a change in the number of goroutines.

The CPU profile can show what triggers stack growth in the same way that the alloc_* variants of the heap profile can for heap memory. But for heap profiles, the inuse_* variants give a more direct answer to which of those are retained.

I've got three kinds of memory profiles in mind:

What allocated the memory that was allocated
What allocated the memory that is retained
What retains the memory that is retained

For heap memory, we get the first from the alloc_* views and the second from the inuse_* views. The third would come from building a dominator tree from a full heap dump / core file, or similar.

For stack memory, we're able to skip right to the third style (it's not a graph, so calculating the dominators is easy). And a CPU profile focused on morestack would give a view on the first. But it seems to me that the second style is the one that would be most helpful for the problem as you've described it, and would be a closer conceptual match to what the other memory profile (the heap profile) provides.

golang / go