Open kpister opened 1 week ago
Thanks for the report. We've been able to reproduce this locally and are looking into it.
In the interim, as a work-around, you can specify --cpu --gpu
(the culprit at the moment appears to be the memory /copy profiling).
Likewise reproduced, also with around a 50x slowdown.
On torch==2.5.1
, disabling the settrace
reduces the runtime from ~100s to ~4s. I'm going to see whether it was introduced in a particular Scalene commit or with a particular Pytorch version.
I do not see this performance degradation at 5e457916606b1ebc
. Bisecting.
Problem was introduced in b9ad0a56582cf4d
I've been looking into this more and the root problem has to do with how and when the interpreter calls the tracing function.
At the moment, it seems that the C logic that decides when to disable the PyEval_SetTrace
callback isn't properly disabling the callback in library calls. This is incurring a function call overhead unconditionally for every single event (opcode, line, call, and return) for every part of every library executed anywhere in the program. Since importing Pytorch invokes a lot of code, the function call overhead adds up incredibly quickly.
I'm making headway on figuring out how precisely to do this disabling, CPython actually checks in several different places and manners to see whether a trace callback exists and whether to execute it. This is governed in both the PyThreadState
struct and the _PyCFrame
struct, with a lot of that happening in ceval.c
. I think I'll have a solution by the end of tomorrow.
Describe the bug Pytorch related imports are taking an extra long time to resolve (15x longer?) when using scalene vs python.
To Reproduce I have a simple test.py file, which is just
import torch
. Runscalene test.py
Wait ~30s for report to finish Runpython test.py
Wait ~2sScreenshots
Versions
I enabled gpu with scalene