grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
9.86k stars 586 forks source link

ebpf: Dropping samples without a stack distorts flamegraphs #2026

Open luisgerhorst opened 1 year ago

luisgerhorst commented 1 year ago

In noticed that samples without a stack are simply dropped instead of being processed (thereby accounting the CPU time spent to the process):

https://github.com/grafana/phlare/blob/11fcb2415e50839b06a827eee70b5a3ccc6f1834/ebpf/session.go#L173

In my opinion, it would be better to still process these comm-only samples as you otherwise get a quite distorted picture when looking at profiles.

For example, kernel threads frequently perform work asynchronously on behalf of user processes. If those samples are simply dropped (if someone only collects user stacks), a user might end up wondering why the CPU time reported from other tools is much larger than the CPU time reported in Pyroscope. Even if this work can not be easily traced back to the application (unless you just run one application, which is common), it is still useful for context and to prevent confusion.

Specifically, you can not currently

Also, can we assume that the kernel helper functions always give you at least one stack frame if the process is currently in user/kernel space? If this is not the case, the picture is further distorted. Even if you not know the function, it is still useful to know that the CPU time was spent in that process.

To be sure: If samples are dropped here, does this still allow for an accurate calculation of the CPU time information in the flamegraphs? I.e., is the CPU time per sample calculated as count(total_time/total_count) or is the amount of CPU time for a sample calculated as count(1s/sampling_rate)? Because of https://github.com/grafana/agent/blob/e3cbf93f47311a4044b9a190adae23f74eb6d654/component/pyroscope/ebpf/ebpf_linux.go#L160 I assume it's the latter.

Other relevant code locations:

korniltsev commented 1 year ago

Hi. Thanks for feedback. The reason I decided to drop them is because "comm-only" stack traces were taking some horizontal space and I felt it was not very useful to stare into void. I'd rather see more stack traces that I can influence, rather than mysterious emptiness. Do you think configuration option to drop/keep them will be sufficient to solve the issue? I'd rather keep them dropped by default and have an option to not drop them?

luisgerhorst commented 1 year ago

Sure, a configuration option would be totally fine. Thanks!