jhawthorn / vernier

📏 next generation CRuby profiler
https://vernier.prof/
MIT License
718 stars 15 forks source link

[feat] :cpu mode for on-CPU collection? #29

Open dalehamel opened 8 months ago

dalehamel commented 8 months ago

Is it feasible to add a :cpu mode similar to what stackprof supports?

Looking at the code, it appears vernier's approach of simply sleeping between samples would not lend itself to this:

https://github.com/jhawthorn/vernier/blob/fe8e509d0a96376acace6f121c010bb04fe69b32/ext/vernier/vernier.cc#L1253-L1261

stackprof makes use of setitimer syscall to have the OS provide this faculty for it, and receives SIGPROF only when the elapsed time actually on the CPU has passed.

Perhaps this is an intentional decision on the part of vernier to avoid needing to deal with signal safety?

Perhaps an alternative approach could be to approximate this by having vernier wake up, check the amount of time on-CPU since last interval, and sleep unless it has met the desired interval threshold? Eg, say we want 10ms intervals on CPU before taking a sample, perhaps we could sleep for 1ms of wall time, accumulate the on-CPU delta, and go back to sleep until the threshold is hit?

jhawthorn commented 8 months ago

What do you want CPU profiling for? There are ways we could measure and incorporate that information, but it would be helpful to know where that information would be helpful over periodic sampling and GVL information.

I only ever used CPU sampling with stackprof because either I wanted to see data from all threads, or I wanted to focus on when the program was active. Vernier records all threads all the time, and shows where the GVL is held (which is usually what I actually wanted out of "CPU" time). So far that has been better for me than what I would have used mode: :cpu for on stackprof. I also feel like cpu mode has been the cause of a lot of confusion for folks, until recently looking at a stackprof CPU flamegraph on speedscope was nonsense because it still used the realtime timestamps as weights, and the resolution of ITIMER_PROF is really poor in a way that I think has made people misunderstand their programs. I'd like to help avoid making the same mistakes when using vernier, but if we find a good way to use CPU utilization I'm open to adding it!

It looks like the firefox profiler will (when enabled in a configuration setting) record CPU time for when a thread is awake and display that as a marker (the checkbox made it sound like it recorded it between each sample, but I don't see how that data is used in the profile viewer, more investigation needed). I think that could make a lot of sense for us to include in our GVL markers ("Thread running" etc) it will cause some overhead but probably not too much.

Image

It looks like they also have an experimental feature to record/show CPU utilization as a trace as a track in the timeline. However it requires running a line in the console to enable, so I wouldn't rush to support this until the profiler UI also included it by default.

dalehamel commented 8 months ago

Thanks for your reply @jhawthorn ! Really cool project :)

What do you want CPU profiling for?

The short answer is that I'd like to see if we can totally replace stackprof with something new, and we already have been using CPU profiling for continuous profiling data. It allows us to treat CPU time as a resource (which indeed it is a resource we're paying for in the cloud), and analyze where it is going. In WALL mode profiles, lots of the profile tends to just be waiting on I/O.

I'll also add that we don't usually use it with one off profiles. Rather, we tend to aggregate large numbers of CPU profiles across many instances and requests. This helps to identify "hot spots" in the code that spend a lot of time on-CPU, and excludes the stuff waiting on I/O. Our current pipeline for this is we submit stackprof to GCS, and have a pub/sub worker convert it into pprof, then submit it to Grafana Pyroscope.

I'll probably need to write a gecko to pprof converter for the existing timer-based profiles either way, and I already have some experience with the gecko format from writing a converter that went the other way. The extra thread and marker information can be added as labels on pprof samples, so tools like Pyroscope still may be able to query it and filter the samples.

Outside of our continuous profiling use case though, I can see why the on-CPU time in a single profile, eg for a single request, probably isn't that interesting especially for mostly short, I/O bounded requests and in a viewer like firefox-profiler, what you described seems much more useful and detailed.

Vernier records all threads all the time, and shows where the GVL is held (which is usually what I actually wanted out of "CPU" time). So far that has been better for me than what I would have used mode: :cpu for on stackprof.

Very interesting, I can see why that could be better. If a thread is blocking in I/O, then maybe just filtering the profiles by the one holding the GVL is a better way to get what we are after here to exclude it. In converting to pprof, perhaps we could just be sure to label which sample has the GVL, and discard or filter out the samples for the non-GVL holding threads....

until recently looking at a stackprof CPU flamegraph on speedscope was nonsense because it still used the realtime timestamps as weights

We ran into this too and it was indeed a very disappointing to see that folks (including myself) were reasoning from false precepts because of it. I'm not sure if the stackprof generated flamegraphs themselves have the problem.

On our internal instance of speedscope, we fixed this by weighting by number of periods instead. Looks like my colleague Manuel is the one who upstreamed the fix in https://github.com/jlfwong/speedscope/pull/425

and the resolution of ITIMER_PROF is really poor in a way that I think has made people misunderstand their programs

Can you elaborate on this? Is this due to the delayed capturing of the sample in recent versions, or something in the OS implementation itself? Or, something else?

My understanding is that if a stack comes up often enough in the on-CPU profiling based on SIGPROF, it should still be statistically reasonable to infer that it is a heavy user of CPU. I agree that abstracting from periods to time is a bit iffy, but for the stacks that actually do have a heavy weight, I'd still expect that weighting by samples * period length is approximately correct, if a bit of a lie. On the flip side, for infrequent stacks, it is usually wrong to infer "I saw this stack once, it must have used 10ms of CPU time" as it could jut be a blip, and we have no way of accounting for what was actually on CPU for most of that period. Is this not the case?

It looks like the firefox profiler will (when enabled in a configuration setting) record CPU time for when a thread is awake and display that as a marker (the checkbox made it sound like it recorded it between each sample, but I don't see how that data is used in the profile viewer, more investigation needed). I think that could make a lot of sense for us to include in our GVL markers ("Thread running" etc) it will cause some overhead but probably not too much.

This seems cool 👍 , I'll need to investigate this more too.