grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
9.65k stars 576 forks source link

Fix truncated Go CPU profiles #3344

Closed simonswine closed 3 weeks ago

simonswine commented 4 weeks ago

This extends the truncation fixing of heap profiles to also cover CPU profiles.

I am also adding an mimir-querier profile, but that seems to be not fixable, as it is doens't contain the connecting stacks.

image
kolesnikovae commented 4 weeks ago

Found a good sample – CloudWatch exporter. AWS Smithy Go code has always been very "deep" because of the countless middlewares, callbacks, hooks, etc.

image

flamegraph_2024-06-07_1220-to-2024-06-07_1320.pb.gz


RE: connecting / overlapping frames – the easiest way to examine truncated stack traces I found is to switch to the sandwich view, where the "ladder" at the top indicates presence of the issue (this is mimir querier in the screenshot, btw)

image
kolesnikovae commented 4 weeks ago

There's one thing that may require an adjustment: a very conservative recursion check that may cause a situation where part of the profiles are not repaired, which makes the whole idea somewhat ineffective

https://github.com/grafana/pyroscope/blob/4837be642ac3521fa30a9faaf7bc7f4a357f853d/pkg/pprof/fix_go_heap_truncated.go#L171-L174

I suggest that we double it regardless of the profile/sample type to relax the restriction.

If it doesn't help and CPU stack traces are still not repaired, we may need to play with constants – token size, suffix length, overlap size, etc.

simonswine commented 3 weeks ago

RE: connecting / overlapping frames – the easiest way to examine truncated stack traces I found is to switch to the ? sandwich view, where the "ladder" at the top indicates presence of the issue (this is mimir querier in the screenshot, btw)

My strategy is using pprof's new sandwich view, it just takes a bit more than doing it in Pyroscope itself.

There's one thing that may require an adjustment: a very conservative recursion check that may cause a situation where part of the profiles are not repaired, which makes the whole idea somewhat ineffective

https://github.com/grafana/pyroscope/blob/4837be642ac3521fa30a9faaf7bc7f4a357f853d/pkg/pprof/fix_go_heap_truncated.go#L171-L174

The first time I see a difference in cloud-watch-exporter's profiles is if I am using values fairly high 56 or higher.

kolesnikovae commented 3 weeks ago

The first time I see a difference in cloud-watch-exporter's profiles is if I am using values fairly high 56 or higher.

Yeah, this is more like a safety check. I also haven't seen too many examples. I'm suggesting increasing it because I see that the repair does not happen in some cases where it should have worked; and this limit is the only meaningful explanation I can find. Actually, I added this limit after testing it in one of our internal deployments, due to the observed CPU burn