Closed brandtbucher closed 1 year ago
If it proves too complex to track multiple local variable states at the same time, another solution might be to only scan for the first hundred local variables or something like that.
It's not clear to me how easy tracking everything would be. Do we need potentially unbounded stack space, since blocks might need to be processed more than once? Would the approach still be linear in the face of many different del
statements in unpredictable places? Maybe we don't care.
But if you manage to do it all linearly, that sounds great.
Here's an example of a "hard" case:
def f():
a1 = a2 = a3 = a4 = a5 = 1
for i in range(2):
if cond1(): use(a1)
else: del a1
if cond2(): use(a2)
else: del a2
if cond3(): use(a3)
else: del a3
if cond4(): use(a4)
else: del a4
if cond5(): use(a5)
else: del a5
# (generalize to more than 5 locals...)
For example, the fact that del a3
may have happened has to propagate to the if cond4()
, to if cond5()
, to for i
, to if cond1()
, to if cond2()
, to if cond3()
, to use(a3)
. Combining that all together with the effects of all other del
statements, there are n
deletions that we have to move over at least n
basicblock linkages each.
That's not a proof that linear time is impossible, just that there may be some cleverness required (moving info across multiple basicblock linkages at once?) if we want linear time in all cases. However, some sort of heuristic may be possible that takes care of most typical cases without many/any del
statements.
By the way, the reason to propagate "possible undefinedness" rather than "certain definedness" is that any path with undefinedness leading into a LOAD_FAST is enough to have to use LOAD_FAST_CHECK.
Here's a script that gathered co_nlocals
data some from Sympy, which I figured might have some long code objects:
Summarized output:
For .py files in Sympy distribution:
------------------------------------
66.08% of code objects have >1 locals
42.96% of code objects have >2 locals
22.44% of code objects have >5 locals
5.32% of code objects have >10 locals
0.79% of code objects have >20 locals
0.08% of code objects have >50 locals
0.04% of code objects have >100 locals
0.03% of code objects have >200 locals
0.02% of code objects have >500 locals
0.02% of code objects have >1000 locals
0.00% of code objects have >2000 locals
Data from other places:
For .py files in Numpy distribution:
------------------------------------
66.77% of code objects have >1 locals
44.70% of code objects have >2 locals
15.82% of code objects have >5 locals
3.37% of code objects have >10 locals
0.38% of code objects have >20 locals
0.03% of code objects have >50 locals
0.00% of code objects have >100 locals
For .py files in Lib/test/ (ignoring SytaxError):
-------------------------------------------------
55.63% of code objects have >1 locals
33.39% of code objects have >2 locals
8.09% of code objects have >5 locals
1.10% of code objects have >10 locals
0.06% of code objects have >20 locals
0.00% of code objects have >50 locals
For .py files in Django distribution:
-------------------------------------
61.79% of code objects have >1 locals
38.18% of code objects have >2 locals
11.70% of code objects have >5 locals
2.68% of code objects have >10 locals
0.32% of code objects have >20 locals
0.00% of code objects have >50 locals
Thanks! The biggest one is https://github.com/sympy/sympy/blob/0e4bab73562d6f90fcdf5fa079731f26a455347c/sympy/integrals/rubi/rules/sine.py#L148, which looks terrifying. It is noticeably slower to compile on 3.12 than 3.11:
% time ./python.exe -m compileall sine.py
Compiling 'sine.py'...
./python.exe -m compileall sine.py 2.66s user 0.09s system 99% cpu 2.763 total
% time ~/py/venvs/311-jun14/bin/python3 -m compileall sine.py
Compiling 'sine.py'...
~/py/venvs/311-jun14/bin/python3 -m compileall sine.py 0.98s user 0.09s system 98% cpu 1.082 total
One speedup might come from doing some precomputation for the "first pass": start by storing i --> array of *basicblocks where DELETE_FAST(i) occurs
(often empty). This wouldn't guarantee linear time overall, but it would greatly speed up almost every first pass. A parameter variable with no del ...
statements could be completely skipped for free.
The b->b_visited = 0;
loop is also quadratic, but faster. We could instead widen b_visited
to more than one bit (or add another field) and do something like
#define MAYBE_PUSH(B) do { \
if ((B)->b_visited > target) { \
*(*stack_top)++ = (B); \
(B)->b_visited = target + 1; \
} \
} while (0)
Though again, I'm not sure whether it would be better to just limit the number of locals to analyze and set the rest to use LOAD_FAST_CHECK
. Or maybe do both.
If someone else hasn't already started a patch, I can work on this.
It won't reduce the big-O complexity, but handling 64 locals at once (using an int64_t
as a vector of 64 booleans) would reduce the constant factor by ~60
This should be fixed now. Thanks for finding this!
(feel free to re-open if I missed something)
@JelleZijlstra discovered something interesting yesterday: bytecode compile times on
main
are currently quadratic in the number of local variables (3.11
is still linear). For example, a function with 100,000 assignments to unique local variables takes well over a minute to compile (but is less than 2 seconds on 3.11).I believe the culprit is the new
LOAD_FAST
/LOAD_FAST_CHECK
stuff. The current implementation appears to loop over every local variable, and for each one, loop over all bytecode instructions. This can probably be refactored to run in one pass for all locals.CC: @markshannon @sweeneyde