Open felixge opened 7 months ago
I think the overall idea, thanks for filing the proposal.
One minor "knee-jerk" reaction I have from looking at the example image is that the display of the _FreeStack
frame is a bit odd. At first glance, it looks like the chanrecv1 path is "using" a lot of stack space, when in reality it isn't really related to stack use, it just happens to be where the goroutine is parked, and thus where the frame gets attached.
That said, I can't immediately think of how to display it better. The fundamental problem here seems to be the aggregation by common stack frames complicating things.
Ways I can think of to "fix" that would be:
Force the tooling to avoid stack aggregation. i.e., every goroutine is displayed on its own with a fake root-level "goroutine N" frame. Under that frame would be two nodes: one with the actual stack frames, and one for _FreeStack
. I think this would make individual goroutines easier to comprehend, at the expense of making it more difficult to see aggregate effects (e.g., I have 128 duplicate worker goroutines that in aggregate use lots of stack in a certain frame).
Avoid the per-goroutine _FreeStack
frame altogether and group extra stack space in the root level _FreeStack
frame. This is a loss of data, though it's not totally clear to me how useful that data is. One useful aspect I can imagine would be learning that you have a decent amount of headroom to increase stack size before actually growing the stack.
Thanks for the feedback @prattmic. I've also added this as a potential agenda item for tomorrow's runtime diagnostics sync as it might be easier to hash out some of these ideas in real time.
One minor "knee-jerk" reaction I have from looking at the example image is that the display of the
_FreeStack
frame is a bit odd. At first glance, it looks like the chanrecv1 path is "using" a lot of stack space, when in reality it isn't really related to stack use, it just happens to be where the goroutine is parked, and thus where the frame gets attached.
Yeah, I agree. It is confusing.
(But I disagree about this being misleading about reality. The free stack memory truly sits below gopark
and the self size of gopark is correct. But yeah, reality is kinda unintuitive here, I agree with that : p)
That said, I can't immediately think of how to display it better. The fundamental problem here seems to be the aggregation by common stack frames complicating things.
Ways I can think of to "fix" that would be:
- Force the tooling to avoid stack aggregation. i.e., every goroutine is displayed on its own with a fake root-level "goroutine N" frame. Under that frame would be two nodes: one with the actual stack frames, and one for
_FreeStack
. I think this would make individual goroutines easier to comprehend, at the expense of making it more difficult to see aggregate effects (e.g., I have 128 duplicate worker goroutines that in aggregate use lots of stack in a certain frame).- Avoid the per-goroutine
_FreeStack
frame altogether and group extra stack space in the root level_FreeStack
frame. This is a loss of data, though it's not totally clear to me how useful that data is. One useful aspect I can imagine would be learning that you have a decent amount of headroom to increase stack size before actually growing the stack.
Good ideas. What about a variant of version 1 where the _FreeStack
is added below the root frame of the goroutine, but without having one root node for every goroutine. Below is a screenshot of what this would look like:
What do you think?
Hm, I do like that better, but the ordering still feels slightly inverted to me (it seems odd that the free stack is the second frame, when the first frame is using real stack space). But this gives me another idea:
What if we invert the idea of "free stack space" and instead display total allocated stack size? Concretely, each stack ends at a "[total]" frame, whose value is the total allocated stack size. In a flame graph, you'd end up with something that looks like this (pardon the ASCII art):
------------------------------------------------
| root |
------------------------------------------------
| [total] |
------------------------------------------------
| threeThousand | twoThousand |
--------------------------------------
The empty space in the bottom left is the total free stack space in the program.
This sounds a bit like my suggestion (2) above because we can't see the free space per goroutine type, but I'm not certain we have to lose this information. In theory, running with -focus threeThousand
should give us just the threeThousand
goroutine portion, and I would want that to display a "[total]" entry with the total only from the threeThousand
goroutines.
I'm not 100% sure this would work out of the box though given the slightly strange way we are exploding samples. We may need to use unique fake PCs per goroutine (0x1d0000001
, 0x1d0000002
, etc), and even then I'm not certain that -focus
would behave the way we want.
I think if I saw I profile like that, I'd soon want to travel back in time to see which call stack the goroutine had when the runtime determined the stack needed to be grown. But it looks like the current proposal is to tie information to the goroutine's current stack.
I'd expect a CPU profile looking at morestack
would give some hints, but that seems similar to the work that's been required to hunt down runtime-internal lock contention: you need to catch it in the act, and you need to make assumptions about the recent behavior being representative of the historical behavior.
To be concrete, consider an HTTP server where: each request is handled quickly, some requests result in deep call stacks, and all arrive on often-idle http/1.1 keep-alive connections. It sounds like the current proposal would show a lot of goroutines with net/http.(*conn).serve
but not ...ServeHTTP
on the stack, some of which have large amounts of runtime._FreeStack
, but without a clear pointer to which part of the http.Handler
was responsible for the growth.
I'm not 100% sure this would work out of the box though given the slightly strange way we are exploding samples.
@prattmic I like this idea, but I don't think -focus
will do what we need it to do. Even with fake PCs.
The problem is that the "empty space" below a node is created by a stack trace that only goes up to that node. In my example from above, the "empty space" below a
is created by a stack trace that just contains the frame a
.
So given the definition of focus below, I think -focus b
will match the stacks a;b;c
, a;b
but not a
(which would be equivalent to the total
node in your example.
-focus Restricts to samples going through a node matching regexp
Looking at the other pprof options neither -show
(only applicable to graph view?) nor -show_from
would do what we need. In theory we could implement a new -focus_parents
option that suits our needs. But I'm still unsure if this could work with your fake pc idea. Because by default the flame graph will be aggregated on a function-level, so the individual pc values should be considered for any operations 🤔.
Anyway, I also realized a few other areas where stack memory can hide:
g0
and gsignal
stacks on M
sTracking this as runtime._FreeStack
on the root level would be misleading. So we would probably need at least runtime._StackFree
(for the stack pool) and runtime._StackSystem
to account for these areas of stack usage separately.
Of course we could also consider to just not deal with any of this, and only show stack memory usage by the frames that are currently on the stack. But this would leave an annoyingly large gap with /memory/classes/heap/stacks:bytes
unless we decide to split this into two or more metrics. However, the advantage of this is that it would be more similar to the heap_inuse
profile where we also only show the live memory usage without the free space (dead objects + free heap + unused heap). This would have been good enough for my analysis in #65532.
@rhysh I see your point. I'm implicitly biased towards continuous profiling where I'm assuming the user has access to many stack profiles and the ability to aggregate them. In practice this should allow the user to see stack profiles that include the frames below morestack
in the CPU profile, thous breaking down their contribution to triggering the growth.
Do you think that's good enough?
I mean I could also see the value in a dedicated morestack
profile, but I guess that's a secondary use case for me. The primary use case is analyzing changes in stack memory usage that don't correspond to a change in the number of goroutines.
The CPU profile can show what triggers stack growth in the same way that the alloc_*
variants of the heap profile can for heap memory. But for heap profiles, the inuse_*
variants give a more direct answer to which of those are retained.
I've got three kinds of memory profiles in mind:
For heap memory, we get the first from the alloc_*
views and the second from the inuse_*
views. The third would come from building a dominator tree from a full heap dump / core file, or similar.
For stack memory, we're able to skip right to the third style (it's not a graph, so calculating the dominators is easy). And a CPU profile focused on morestack
would give a view on the first. But it seems to me that the second style is the one that would be most helpful for the problem as you've described it, and would be a closer conceptual match to what the other memory profile (the heap profile) provides.
Proposal Details
Summary
I'm proposing to implement a new profile type that allows to break down goroutine stack space usage by frame.
goroutine
profileruntime._FreeStack
leaf node/memory/classes/heap/stacks:bytes
Given the above, perhaps this is small and simple enough to skip the official proposal process. But since the design includes a few opinionated choices, it's probably best to have some discussions and consensus upfront.
Motivation
My main motivation for this came from debugging a case of dramatic stack space growth while deploying PGO to production (#65532) which I was able to root cause using a hacky stack space profiler that I implemented in userland (repo).
Additionally I imagine this profile type will be useful for other scenarios, e.g. high cpu usage in
morestack
(#18138).Implementation
I'm proposing to implement this profile type by taking each stack trace in the goroutine profile and looking up it's frame size (❶ shows this for a single stack trace). Then each stack trace is broken into one stack trace per prefix (from the root), and these stack traces are assigned the frame size of their leaf frames as values (❷). This will produce a flame graph where the "self value" of each frame corresponds to its frame size, and its total value corresponds to its frame size plus the frame sizes of its children (❸).
These values are then multiplied by the number of goroutines that were captured for the given stack trace, resulting in the sum of stack space usage.
Last but not least, a
runtime._FreeStack
leaf node is added to capture the delta between the stack space used by frames, and the total size of the stack allocated for the goroutine. Additionally a root-levelruntime._FreeStack
is used to show the amount of memory reserved for goroutine stacks that is currently not in use. These virtual frames are motivated by producing a profile that adds up to/memory/classes/heap/stacks:bytes
as well as giving the user the ability to reason about potentialmorestack
issues.Prototype
I have uploaded a rough CL for a prototype here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests).
Using this prototype we can look at a real stack profile for a program with the following goroutines:
Code Snippet
```go func launchGoroutinesWithKnownStacks() func() { c1 := make(chan struct{}) c2 := make(chan struct{}) c3 := make(chan struct{}) c4 := make(chan struct{}) go oneThousand(c1) go twoThousand(c2) go threeThousand(c3) go threeThousand(c4) <-c1 <-c2 <-c3 <-c4 // hacky way to ensure all goroutines reach the same <-ch statement // TODO(fg) make caller retry in the rare case this could go wrong time.Sleep(10 * time.Millisecond) return func() { c1 <- struct{}{} c2 <- struct{}{} c3 <- struct{}{} c4 <- struct{}{} } } //go:noinline func oneThousand(ch chan struct{}) [1000]byte { var a [1000]byte ch <- struct{}{} <-ch return a } //go:noinline func twoThousand(ch chan struct{}) [2000]byte { var a [2000]byte ch <- struct{}{} <-ch return a } //go:noinline func threeThousand(ch chan struct{}) [3000]byte { var a [3000]byte ch <- struct{}{} <-ch return a } ```Note: The prototype doesn't implement the proposed root-level
runtime._FreeStack
frame yet.Performance
I not measure this yet, but I suspect all of this can be done with negligible impact on the overhead of the goroutine profile.
Next Steps
Please let me know what you think. cc @prattmic @mknyszek @nsrip-dd @rhysh (this was previously discussed in a recent runtime diagnostics sync, see notes).