pyutils / line_profiler

Line-by-line profiling for Python
Other
2.68k stars 119 forks source link

Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

Open fnobis opened 1 year ago

fnobis commented 1 year ago

I run a multiline statment, calling onto a PyTorch network which runs in about 50ms. When measuring this with line_profiler, the measured time is strangely low. When putting the code in one line, the measurement seems to be ok. In the multiline case, the number of hits is given as 2 even though the model and all other lines in the code are only run once.

Wrong time calculated, multiline

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   194         2          1.5      0.8      0.0              output_1, output_2 = self.model(
   195         2          0.7      0.3      0.0                  tensor_in, **parameters
   196                                                       )      

Correct time calculated and hits number, single line

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   194         1      50721.4  50721.4      4.5              output_1, output_2  = self.model(tensor_in, **parameters)
Erotemic commented 1 year ago

Strange, can you make a MWE that reproduces this with a simple torch model?

Theelx commented 1 year ago

That is indeed interesting. I was working with using line_profiler on PyTorch models a few weeks ago and I didn't notice this issue, but I could easily have been doing something wrong.

Theelx commented 1 year ago

Also, in addition to an MWE, can you give us your platform (Windows vs Mac vs Linux), Python version, and line_profiler version? This may be related to #210, which is fixed if you install from this git repo but hasn't been officially released yet. So, maybe running your code with the version of line_profiler in this repo could help.

tmm1 commented 1 year ago

You need to make sure to set CUDA_LAUNCH_BLOCKING=1 for accurate results, otherwise the cuda kernels are running async and all the time will accumulate at the wrong line whenever there's a cuda sync happening.