Support non interleaving multi-event profiling

grafana / pyroscope-java

pyroscope java integration

Apache License 2.0

82 stars 32 forks source link

Support non interleaving multi-event profiling #107

Closed tomershafir closed 1 year ago

tomershafir commented 1 year ago

Concurrent multi-event collection can interfere each other and skew the results, for example, a bad case, where an app in steady state allocates and computes correspondingly with each other with respect to sampling config, threads performing tlab sampling may be captured by cpu profiler, skewing the flame graph.

A solution may be a fixed wall time window mechanism. Then, you can for example on a cpu-bound app sample cpu 1m then sample tlab 1m then sleep 3m. I may be able to help if you think its useful

korniltsev commented 1 year ago

I dont necessarily agree on skewing but this may be useful for other reasons.

We have a ProfilingScheduler interface and default ContinuousProfilingScheduler implementation. If you only want this scheduler only for your services, I highly recommend implement it on your side(not in the library)

If you want this scheduler to be in the library : There's another community submited scheduler SamplingProfilingScheduler which is disabled by default. It should go there, it preserve capabilities it currently has. Bear in mind it is experimental, we do not commit supporting it, it and may be changed or removed or broken in the future.

tomershafir commented 1 year ago

indeed, it may also be useful to manage multi-event aspects like overhead.

Btw, I think SamplingProfilingScheduler should definitely be supported, I think it is the most useful in real productions.

tomershafir commented 1 year ago

@korniltsev why not merge the 2 schedulers?

korniltsev commented 1 year ago

why not merge the 2 schedulers?

Sampling will make it not continuous profiling but "almost continuous" profiling and it may miss something important. The overhead is small if configured properly. We also have no sampling in our other sdks, java sdk should be aligned with others, we should discuss introducing sampling not for java but for all sdks.
I dont know how to implement it in a way that will satisfy everyone. Like the previous contributor wanted one thing, and now you want a slightly different thing. And next week someone else will come with some other ideas. Thus as I said earlier it is experimental, we do not commit supporting it, it and may be changed or removed or broken in the future.

korniltsev commented 1 year ago

сс @Rperry2174

tomershafir commented 1 year ago

Continuous profiling is inherently sampling based, so even without sleep() you miss. Also, if you try to configure no sleep policy to match sleep based one by distributing the same load over larger time window to manage overhead, the probability to catch steady state paths of first option is lower than the second.
You should decide if a feature is valuable. If so, I think you should do eventual consistency, rather than block a sdk. I also dont see such strict consistency in the oss wild