golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

Open nsrip-dd opened 2 years ago

nsrip-dd commented 2 years ago

CPU profiles currently have a hard-coded maximum of 64 frames per call stack. However, this limit can be too low, especially when programs use middleware libraries or deep recursion. Call stacks deeper than 64 frames get truncated, which makes profiles difficult to interpret.

I propose making the maximum stack size configurable. Specifically, we can build on the API accepted in #42502 and add the following method to configure the maximum stack size:

// SetMaximumStackSize limits call stacks in the profile to no more than n frames.
// If no limit is set, the default is 64 frames.
func (*CPUProfile) SetMaximumStackSize(n int)

(Should n <= 0 imply no limit? Or to use the default? Or should it be considered invalid and panic/cause CPUProfile.Start to return an error?)

Alternatively, the hard-coded limit could be increased. This could be implemented with no new public API. One reason for making the limit configurable, as opposed to just increasing it, would be to manage the overhead of CPU profiling due to collecting call stacks. Users who want to reduce CPU profiling overhead can keep the limit low. Users who want more detailed profiles can raise the limit.

See also #43669. I've limited this proposal to CPU profiles since this change is easier given the current profile implementations. Ideally the limit for all profile types could be increased, especially for the heap, mutex, and block profiles where the current limit is even lower.

dominikh commented 2 years ago

It'd at least be nice to standardize on a number. CPU profiles use 64 frames, some profiles use 32 frames, and runtime/trace uses 128 frames, which can be particularly jarring because runtime/trace can include CPU profiling samples as events, mixing 64 and 128 frame stacks.

felixge commented 2 years ago

@dominikh +1 on standardizing the default. FWIW runtime.Stack() (aka pprof.Lookup("goroutine").WriteTo(w, 2)) uses a limit of 100 frames.

But as outlined in #43669 this will be a bigger effort for the profiles with public API surface. Our hope for this proposal (I'm working with @nsrip-dd) is to break up the work into smaller pieces, with this proposal being a nice self-contained step forward.

prattmic commented 2 years ago

I also agree that we should at least be consistent in the various interfaces.

Stepping back on the proposal, I'm not sure that we need to add an explicit API vs just significantly increasing or eliminating the limit.

Others that remember the history better can correct me, but I don't think that avoiding poor performance is the primary reason for the fixed 64 frame limit. The internal traceback API, gentraceback is a single monolithic function that writes all frames to an output slice (see also #54466). CPU profile stack samples are collected in the signal handler, where dynamic memory allocation is tricky. Putting a fixed size [64]uintptr array on the signal handler stack and passing that to gentraceback is a simple way to avoid complexity and have things 'just work'.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

Of course, extremely deep frames can still cause slowdown, as tracing is proportional to the number of frames. The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames? Given that CPU profiling is opt-in, I suspect that extra visibility would be nearly universally preferred [1].

Thus, I'd propose we don't add an API and simply eliminate the limit (or a very high bound like 1024 frames).

[1] If we do keep an API, it seems like it could just be a binary DisableStackTruncation API. I don't think anyone has a specific reason to want to set a specific number. e.g., why would someone pass 128 vs 512?

cc @golang/runtime

nsrip-dd commented 2 years ago

Thanks @prattmic. It makes sense that the current implementation is simple because of the constraint of collecting stacks from a signal handler.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

That would be great! Perhaps the current limit could be raised as an incremental improvement until such a change is made?

The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames?

Personally, I'd prefer getting as much information as possible. I suspect that the overhead increase after changing or eliminating the limit would be very small for most applications. That said, increasing the limit would have some performance impact on some applications. Right now CPU profiling has low enough overhead with the default configuration that we've been able to run it all the time in production at Datadog with no problems. It would be great if it stays that way.

Going off your suggestion of a binary API, maybe unlimited/big stack traces could be opt-out rather than opt-in? Maybe through a GODEBUG variable? Or make the limit configurable through a GODEBUG variable (cpuprofilestackdepth=N?), with the default being no limit? That way profiles default to having more information, but there's still the option of retaining the current low, predictable overhead.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 2 years ago

Part of the reason for the limit is that the implementation stores that many frames for every sample. We are planning to remove that bad implementation decision which will make it easier to handle a much larger default limit, like maybe 256 or 1024. It sounds like we should try that higher limit first before deciding if we need more configurability.

nsrip-dd commented 2 years ago

Thanks @rsc. Fortunately, I think you've already addressed that implementation decision with https://go.dev/cl/36712 :)

It makes sense to focus on improving the implementation before worrying about configurability. My overhead concerns are probably outweighed by the usefulness of having fewer truncated stacks. Given that, I'd be fine withdrawing this proposal since the motivating problem (truncated stacks) can be addressed without API changes.

I can send a small CL to change the current limit (maxCPUProfStack), which can be increased a good deal without requiring any other changes to the code. I think that would be a good incremental improvement until the implementation is changed.

gopherbot commented 2 years ago

Change https://go.dev/cl/445375 mentions this issue: runtime: increase CPU profile stack size limit

rsc commented 2 years ago

@nsrip-dd As @prattmic mentioned above, the problem is that the [64]uintptr (now [512]uintptr in your CL) is stack allocated, and that's kind of a lot to zero during the signal handler. (At least we're on the signal stack so there's no worry about overflow.) We may need to find a better way to save more.

Maybe if profiling is enabled we make sure every m has a slice the signal handler can write to?

nsrip-dd commented 2 years ago

I hadn't considered that implication, thank you for pointing it out. I've done some quick benchmarks to see what the latency increase would be of going from a 64[uintptr] to a 512[uintptr]. On my Intel MacBook, going from 64 to 512 adds an additional 20 nanoseconds of latency to zeroing the array. So making the array bigger does come with a cost. However, call stack unwinding can take several microseconds (more benchmarks). Assuming the relative difference between zeroing the array and unwinding a call stack is consistent across other platforms, I believe the relative latency increase from making the stack bigger would be small.

That said, I've sketched out an implementation of per-m slices for CPU profile call stacks. It seems pretty simple, but on the other hand I think that approach would trade a small amount of overhead for the cognitive load of determining whether the slice will be properly initialized for the right m before it's needed by the CPU profiler. If that tradeoff is OK, I can submit another CL.

rsc commented 2 years ago

It sounds like we all agree that we should make the pprof handler record many more stack frames by default, one way or another, and that therefore we don't need to make the CPU profile maximum stack size configurable. Since the configuration has disappeared, we can remove this from the proposal process.

rsc commented 2 years ago

Removed from the proposal process. This was determined not to be a “significant change to the language, libraries, or tools” or otherwise of significant importance or interest to the broader Go community. — rsc for the proposal review group

gopherbot commented 1 year ago

Change https://go.dev/cl/458218 mentions this issue: runtime: implement traceback iterator

aktau commented 9 months ago

AFAIK the new traceback generator is in. Can this issue be reconsidered?