plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.21k stars 399 forks source link

Extra slow pytorch imports (~30s) #889

Open kpister opened 1 week ago

kpister commented 1 week ago

Describe the bug Pytorch related imports are taking an extra long time to resolve (15x longer?) when using scalene vs python.

To Reproduce I have a simple test.py file, which is just import torch. Run scalene test.py Wait ~30s for report to finish Run python test.py Wait ~2s

Screenshots

image

Versions

I enabled gpu with scalene

emeryberger commented 1 week ago

Thanks for the report. We've been able to reproduce this locally and are looking into it.

emeryberger commented 1 week ago

In the interim, as a work-around, you can specify --cpu --gpu (the culprit at the moment appears to be the memory /copy profiling).

sternj commented 1 week ago

Likewise reproduced, also with around a 50x slowdown.

sternj commented 6 days ago

On torch==2.5.1, disabling the settrace reduces the runtime from ~100s to ~4s. I'm going to see whether it was introduced in a particular Scalene commit or with a particular Pytorch version.

sternj commented 6 days ago

I do not see this performance degradation at 5e457916606b1ebc. Bisecting.

sternj commented 6 days ago

Problem was introduced in b9ad0a56582cf4d

sternj commented 5 days ago

I've been looking into this more and the root problem has to do with how and when the interpreter calls the tracing function.

At the moment, it seems that the C logic that decides when to disable the PyEval_SetTrace callback isn't properly disabling the callback in library calls. This is incurring a function call overhead unconditionally for every single event (opcode, line, call, and return) for every part of every library executed anywhere in the program. Since importing Pytorch invokes a lot of code, the function call overhead adds up incredibly quickly.

I'm making headway on figuring out how precisely to do this disabling, CPython actually checks in several different places and manners to see whether a trace callback exists and whether to execute it. This is governed in both the PyThreadState struct and the _PyCFrame struct, with a lot of that happening in ceval.c. I think I'll have a solution by the end of tomorrow.