Add a per-operation run time breakdown

geoffxy commented 4 years ago

We need to add in the per-operation run time breakdown again. Previously we used a custom profiling solution. This time we should explore the possibility of using some other profiling tools (e.g., the PyTorch profiler that ships with NVIDIA's apex).

[x] Investigate existing profiling solutions (primarily apex/pyprof)
[x] Implement the needed changes in the server
[x] Add a visualization to the plugin

geoffxy commented 4 years ago

I've looked through apex/pyprof and decided that it is not well suited for our use case. There are a couple of reasons why:

It relies on nvprof (not future-proof for NVIDIA Nsight (i.e. Turing architecture GPUs))
The authors have said on Twitter that it does not work past PyTorch 1.1 (https://twitter.com/marekinfo/status/1187761747412668416)
We would need to write our own code modifying pyprof to extract information we care about (the console based output is not suitable for our needs)
pyprof's "monkey patching" is destructive: there's no way to remove the hooks they add, which is currently problematic for how we run our analysis
As far as I can tell, pyprof doesn't provide run times at the operation level
I'm worried about profiling overhead introduced by nvprof

geoffxy commented 4 years ago

Completed as of commit 111661a90a7da909ada66d63a57c11b34d477216.

skylineprof / skyline

Add a per-operation run time breakdown #28