Open legel opened 8 months ago
PS probably it's obvious, by color-coding, I mean, e.g.
Gradient from Blue to Red, where the darkest blue == max seconds of processing time, darkest red == least seconds, based on a simple min/max normalization from all computed graph elements, and a best-estimate allocation of the Profile runtimes per element, shown on your graph viz...
This would be a good feature and should be possible. However, displaying profiler statistics as a color coding would clash with the highlight colors to show which node is being visualized in the right-hand panel. I'm instead envisioning something like the following mockup: The bars on the left hand side are runtime statististics, right hand side would be memory. For runtime, black is forwards pass and gray is backwards pass. For memory, black is the memory consumed by activations and gray is memory consumed by parameters (possibly including gradients for the backwards pass?). Heights are normalized as you described.
Mouseover would provide a tooltip explaining each bar and giving the concrete value (e.g., "Fwd pass runtime: 3e-4").
In terms of implementation, I don't know much about profiling but was thinking the following.
Runtime
torch.cuda.synchronize
and compute all the normalized time deltas.Memory
zero_grad
is called with set_to_none=False
(default))Conv2d
submodule twice in one forward
call from the parent module, the Conv2d
node will show twice in the explorer graph. The forwards activation memory is pretty clear, that happens on a per-invocation basis. But the parameter memory usage is "shared" between the nodes. I'm inclined to just show the full memory usage for each repeated node, but this could have the unfortunate side effect that the sum of the memory usages of all nodes in the graph doesn't actually add up to the total memory usage. cuda.memory_allocated
in the pre-forward hook and post-forward hook and measuring the difference. I expect that this should work without running into asynchronous issues? (correct me if I'm wrong)Shared details
log_freq
'th iteration. Profiling results will be updated in some kind of a running average.Question Does the above make sense to you / match your expectations of what you'd like to see profiled? Any other suggestions are also welcome.
Thanks @spfrommer for the super fast, detailed, and visionary reply!
I'm very happy to affirm a lot of what you describe where it makes sense to me, and also to critically propose changes where I think it would result in a more useful UX.
I see your concern. Consider the following...
Runtime Analysis - Forward + Backward Pass
, ... , Memory Analysis - CPU RAM
, or Memory Analysis - GPU VRAM
.TOTAL RUNTIME: X.XX milliseconds
, TOTAL CPU RAM: Y.Y GB
, TOTAL GPU VRAM: Z.Z GB
... It's a neat graphical design that you propose. The reason why I push back and suggest doing extra gymnastics for color is because in my research and decade of work in data visualization I've found that colors are by far the most valuable tool for dealing with complexity. I think it would be difficult to very quickly determine performance bottlenecks at a high-level across a very large computational graph for the design you propose, because the many small black/grey bars would be difficult to see when zoomed out. I also figured it wouldn't be too hard to go with the "gold standard" of a fully coded-coded view of the graphical network...
Above, we use the full color spectrum for visualization, following science of the Turbo colormap designed by Google.
Above is based on Python code that I just made for demo purposes here: neural_network_turbo_color_coding.py.zip
As I suggested, I would recommend just going after one of the profiling challenges first and trying to deploy on just that. Between runtime vs. memory profiling, there's some pretty nice work in the right direction for memory profiling recently published by Torch maintainers, detailed here. So, I think when it comes time to implement memory profiling, their approach should help.
For starters, the simplest and most essential task is to get runtimes for every single computational element (I figure trying to parse and post-process outputs from the Torch Profiler might get there, but not sure how well it matches up with your graphical modeling).
While logging every n-th iteration works, just doing the profile for a single forward and backward pass would already be super useful; e.g. could save that data to an export file, and allow the user to dynamically explore the details of that.
There's a lot more I could get into here, but hopefully above is helpful and sufficiently inspiring! I wish I had it today for an actual profiling project I need move forward on now...
Really appreciate the detailed feedback! I agree the colors would be better for usability, and having different analysis modes is the better UI. My suggestion was mostly motivated by practical considerations with Vega. Vega has a really focused grammar: it is not designed to support fancy GUIs, and I'm really pushing the limitations of what it can do. Nevertheless, it's still probably better to just do it properly than to keep hacking additional features into one interface.
To select between the different analysis modes, I'd need an equivalent of a drop down box or a radio button. Vega supports binding inputs to html elements, but these are outside of the visualization pane and by default are ugly and uncustomizable within the json spec. Customizing them to be appropriately positioned and formatted involves custom CSS external to the spec, which would be possible on the standalone backend but not the wandb interface. I'd probably have to hack it together with a Vega legend as a radio button.
Having the panels on the right hand side be custom for each analysis mode is probably the right design choice, but would involve a significant rearchitecting of how I handle the panels in Vega.
The pytorch memory profiling tools you've linked seem to largely involve examining memory allocation over time to detect memory leaks. I think the natural thing to display on the right-hand panels for the "memory analysis" mode would be the per-module line chart for the memory usage over the forwards pass and backwards pass, with time on the x axis (how to even get this information is another complication). This would help narrow down if a specific module is leaking memory in the forwards pass (e.g., by saving intermediate tensors to the module attributes). But my feeling is that most memory leaks happen outside of the module forward
invocations (i.e., with metrics logged on the module output without calling detach()
first).
In summary, I really like your suggestions and I'll probably refer to our discussions when I end up implementing this. But it involves a big architectural rework for what's essentially a side project to my research--it's probably not something that I'll get around to in the near future.
Gathering data from https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
...it would be fantastic if there was a library with a one-line API comparable to what Tensorboard previously offered with TensorFlow, for color-coded graph visualization of performance metrics per computational graph element -- namely, runtime, but also of interest, would be memory metrics... e.g. see https://branyang.gitbooks.io/tfdocs/content/get_started/graph_viz.html
The problem with Tensorboard PyTorch support is apparently it's a mess right now... Please ping me if this is of interest to develop, I think it would greatly help ML developers to be able to both visualize graphs and visualize performance bottlenecks of the graphs...