plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
11.55k stars 387 forks source link

potential enhancement: stack-awareness #496

Open RhysU opened 1 year ago

RhysU commented 1 year ago

@daniel-shields and I were both surprised that Scalene isn't stack-aware as @emeryberger confirms in #33. Daniel has a use case where a multiprocessing application using remote endpoints reports that all the wall-time is spent in synchronization primitives but not what stack is waiting on those primitives. Adding stack awareness would expand Scalene's applicability to use cases like Daniel's.

Please consider:

  1. Expanding https://raw.githubusercontent.com/plasma-umass/scalene/master/docs/images/profiler-comparison.png to include stack-awareness. It's an important dimension when choosing a profiler.
  2. Adding stack-awareness to Scalene

I realize the second thing isn't a small thing. I mainly wanted to capture it so that others might find the detail.

sternj commented 1 year ago

So, for a bit of context here-- This was a tradeoff that we made in the initial development of Scalene. Adding a flamegraph option could serve to alleviate that (I think it's feasible, we do have access to most of the information while stack walking), but outside of a flamegraph, and in the UI we initially want to provide, it's a bit tough to represent stacks. I think it's definitely possible that it's in the future of Scalene (and you're both welcome and encouraged to implement it yourself!) but its use-cases and ergonomics are both relatively limited.

If you make a PR I'll be happy to work with you to merge it in! If you want to do it, I'd advise starting here, since it is where we process CPU samples.

From a design perspective, I would recommend not storing this information unless the flamegraph is certainly going to be built. Extra overhead in a non-Flamegraph run is unacceptable.

Implementing a flamegraph for memory samples would be difficult, unwieldy, and useless. The cases where stack context is intimately related to memory consumption are few and far between, and unlike the cases in which stack context matters for CPU sampling, we just haven't seen any of them in our work. As such, if you're putting in a flamegraph option, I'd advise forcing a --cpu-only run.

RhysU commented 1 year ago

Thank you @sternj, that's a lot of nice context.

@daniel-shields, would a --cpu-only --flamegraph-like option have given you the sort of information that you wanted for your use case? Given @sternj's context I'd like to make sure the feature in light of that context would have improved Scalene's utility to you.

emeryberger commented 1 year ago

Just a note: it is already possible to use Scalene for the specified use-case of finding who is waiting on a synchronization operation via a command-line option. As long as the waits are in a separate file from the rest of the code, you can simply --profile-exclude the file in question and Scalene will - in this case - only report the parent callers. @sternj also proposed extending the --profile-exclude syntax to support exclusion of specific functions (e.g., --profile-exclude somecode.py:wait_for_something).

frankier commented 1 year ago

Thanks for this tool! It's very neat, and compliments py-spy (a low overhead sampling profile providing nice flamegraphs) very well.

One thing I would like to attach to the feature request, acknowledging that any implementation of this is probably quite a lot of extra work, is that the proposal of --cpu-only --flamegraph might not be quite enough to improve upon the current situation of being able to running Scalene and then py-spy (apart from having less tools to install). There are couple of small extras which would push it into providing extra utility: one is being able to capture both a flamegraph profile and a full Scalene profile including (non-stack) memory information at the same time. This would be more convenient and provide extra assurance that the views are really views of the same data. Another is being able to get a GPU flamegraph which could be useful in some circumstances e.g. in deep learning seeing how much time is spent in training versus evaluation, which can both appear as inside the model code without stack awareness.

In the short term, it could be an idea to note that at the moment py-spy is a complimentary tool in the README.

battaglia01 commented 1 year ago

Hi, was trying to get some kind of stack trace output similar yo py-spy and just saw this. Scalene does have --stacks:

--stacks collect stack traces

but I haven't seen the output change if this is set. What does --stacks do and is it something different than what is being talked about here?

ggoretkin-bdai commented 1 year ago

From searching the code, it seems that --stacks has no effect.

mcarans commented 1 month ago

I can see --stacks output in the JSON, but it's not clear to me how to use it