Feature request: color-coded graphs for performance visualization

Gathering data from https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

...it would be fantastic if there was a library with a one-line API comparable to what Tensorboard previously offered with TensorFlow, for color-coded graph visualization of performance metrics per computational graph element -- namely, runtime, but also of interest, would be memory metrics... e.g. see https://branyang.gitbooks.io/tfdocs/content/get_started/graph_viz.html

The problem with Tensorboard PyTorch support is apparently it's a mess right now... Please ping me if this is of interest to develop, I think it would greatly help ML developers to be able to both visualize graphs and visualize performance bottlenecks of the graphs...

PS probably it's obvious, by color-coding, I mean, e.g.

Gradient from Blue to Red, where the darkest blue == max seconds of processing time, darkest red == least seconds, based on a simple min/max normalization from all computed graph elements, and a best-estimate allocation of the Profile runtimes per element, shown on your graph viz...

This would be a good feature and should be possible. However, displaying profiler statistics as a color coding would clash with the highlight colors to show which node is being visualized in the right-hand panel. I'm instead envisioning something like the following mockup: draftprofile The bars on the left hand side are runtime statististics, right hand side would be memory. For runtime, black is forwards pass and gray is backwards pass. For memory, black is the memory consumed by activations and gray is memory consumed by parameters (possibly including gradients for the backwards pass?). Heights are normalized as you described.

Mouseover would provide a tooltip explaining each bar and giving the concrete value (e.g., "Fwd pass runtime: 3e-4").

In terms of implementation, I don't know much about profiling but was thinking the following.

Runtime

Add a pre forward hook / post forward hook, capture the starting time and end times with cuda.Event (if on gpu) or a regular timer (if on cpu). Have to look what to do for MPS.
After everything is run, throw in a torch.cuda.synchronize and compute all the normalized time deltas.
Similar logic for backwards pass.

Memory

This is a little trickier, as there are three ways that a module uses memory: parameters (persisted), activations (only on forwards pass, freed on backward), and parameter gradients (persisted if zero_grad is called with set_to_none=False (default))
Another wrinkle is how to handle modules that are reused in the computational graph. For example, if we use the same Conv2d submodule twice in one forward call from the parent module, the Conv2d node will show twice in the explorer graph. The forwards activation memory is pretty clear, that happens on a per-invocation basis. But the parameter memory usage is "shared" between the nodes. I'm inclined to just show the full memory usage for each repeated node, but this could have the unfortunate side effect that the sum of the memory usages of all nodes in the graph doesn't actually add up to the total memory usage.
To make the presentation similar to the runtime bars, the black bar should be the memory consumed by activations only (during to forwards pass) and the gray bar should be the memory used by the parameters (as this exact memory is what is doubled in the backwards pass to store the gradients).
For measuring the activation memory, I was thinking about using cuda.memory_allocated in the pre-forward hook and post-forward hook and measuring the difference. I expect that this should work without running into asynchronous issues? (correct me if I'm wrong)
Memory will have to include the memory used by submodules. This is a little inconsistent with the fact that we don't visualize the parameter tensors of nested submodules, but it'll have to do.

Shared details

Profiling will only happen every log_freq'th iteration. Profiling results will be updated in some kind of a running average.

Question Does the above make sense to you / match your expectations of what you'd like to see profiled? Any other suggestions are also welcome.

Thanks @spfrommer for the super fast, detailed, and visionary reply!

I'm very happy to affirm a lot of what you describe where it makes sense to me, and also to critically propose changes where I think it would result in a more useful UX.

"...displaying profiler statistics as a color coding would clash with the highlight colors to show which node is being visualized in the right-hand panel."

I see your concern. Consider the following...

TorchExplorer as a multi-purpose tool, which could support multiple "analysis modes".
All of your work on graphical network analysis, flow, etc. + UI elements for deep diving into the graphical network, could serve as foundational tech for other analysis modes...
Visualization of weights and biases is a primary analysis mode currently. It's not really explicitly labeled as an option with e.g. a toggle/switch/dropdown selector, because currently it's the only main analysis mode.
Imagine a user selecting a different analysis mode, e.g besides default of "Weights & Biases Analysis", user could instantly select Runtime Analysis - Forward + Backward Pass, ... , Memory Analysis - CPU RAM, or Memory Analysis - GPU VRAM.
Benefits of this kind of wrapper around the entire project:
1. TorchExplorer's sub-module View, now visualizing the detailed Weights & Biases charts / metrics, could be reused for different dedicated charts / metrics for different analysis modes. For starters, for runtime/memory analysis, the area would be usefully and easily filled up with high-level metrics like TOTAL RUNTIME: X.XX milliseconds, TOTAL CPU RAM: Y.Y GB, TOTAL GPU VRAM: Z.Z GB...
2. TorchExplorer's graphical network coloring, now linking together the different sub-views of module weights/biases with the overall graphical network, could be customized for whatever is the selected analysis mode. This makes a ton of sense, because it is unlikely that I'll be debugging weights/biases issues at the exact same time as debugging runtime. Similarly, I'd probably like to focus on debugging/visualizing just one of runtime or memory, but not both simultaneously. This makes for a cleaner UI and follows TensorBoard. Another nice upshot of this modularization: first we can solve development of runtime graphical visualization, and then after deploying that, can look to solve memory graphical visualization...

"...The bars on the left hand side are runtime statististics, right hand side would be memory. For runtime, black is forwards pass and gray is backwards pass. For memory, black is the memory consumed by activations and gray is memory consumed by parameters (possibly including gradients for the backwards pass?). Heights are normalized as you described."

It's a neat graphical design that you propose. The reason why I push back and suggest doing extra gymnastics for color is because in my research and decade of work in data visualization I've found that colors are by far the most valuable tool for dealing with complexity. I think it would be difficult to very quickly determine performance bottlenecks at a high-level across a very large computational graph for the design you propose, because the many small black/grey bars would be difficult to see when zoomed out. I also figured it wouldn't be too hard to go with the "gold standard" of a fully coded-coded view of the graphical network...

Design proposal visualization

Above, we use the full color spectrum for visualization, following science of the Turbo colormap designed by Google.

Above is based on Python code that I just made for demo purposes here: neural_network_turbo_color_coding.py.zip

"In terms of implementation..."

As I suggested, I would recommend just going after one of the profiling challenges first and trying to deploy on just that. Between runtime vs. memory profiling, there's some pretty nice work in the right direction for memory profiling recently published by Torch maintainers, detailed here. So, I think when it comes time to implement memory profiling, their approach should help.

For starters, the simplest and most essential task is to get runtimes for every single computational element (I figure trying to parse and post-process outputs from the Torch Profiler might get there, but not sure how well it matches up with your graphical modeling).

_"...Profiling will only happen every logfreq'th iteration. Profiling results will be updated in some kind of a running average."

While logging every n-th iteration works, just doing the profile for a single forward and backward pass would already be super useful; e.g. could save that data to an export file, and allow the user to dynamically explore the details of that.

There's a lot more I could get into here, but hopefully above is helpful and sufficiently inspiring! I wish I had it today for an actual profiling project I need move forward on now...

Really appreciate the detailed feedback! I agree the colors would be better for usability, and having different analysis modes is the better UI. My suggestion was mostly motivated by practical considerations with Vega. Vega has a really focused grammar: it is not designed to support fancy GUIs, and I'm really pushing the limitations of what it can do. Nevertheless, it's still probably better to just do it properly than to keep hacking additional features into one interface.

To select between the different analysis modes, I'd need an equivalent of a drop down box or a radio button. Vega supports binding inputs to html elements, but these are outside of the visualization pane and by default are ugly and uncustomizable within the json spec. Customizing them to be appropriately positioned and formatted involves custom CSS external to the spec, which would be possible on the standalone backend but not the wandb interface. I'd probably have to hack it together with a Vega legend as a radio button.

Having the panels on the right hand side be custom for each analysis mode is probably the right design choice, but would involve a significant rearchitecting of how I handle the panels in Vega.

The pytorch memory profiling tools you've linked seem to largely involve examining memory allocation over time to detect memory leaks. I think the natural thing to display on the right-hand panels for the "memory analysis" mode would be the per-module line chart for the memory usage over the forwards pass and backwards pass, with time on the x axis (how to even get this information is another complication). This would help narrow down if a specific module is leaking memory in the forwards pass (e.g., by saving intermediate tensors to the module attributes). But my feeling is that most memory leaks happen outside of the module forward invocations (i.e., with metrics logged on the module output without calling detach() first).

In summary, I really like your suggestions and I'll probably refer to our discussions when I end up implementing this. But it involves a big architectural rework for what's essentially a side project to my research--it's probably not something that I'll get around to in the near future.

spfrommer / torchexplorer