python / cpython

The Python programming language
https://www.python.org
Other
62.17k stars 29.88k forks source link

Bytecode compile times are `O(nlocals**2)` #97912

Closed brandtbucher closed 1 year ago

brandtbucher commented 1 year ago

@JelleZijlstra discovered something interesting yesterday: bytecode compile times on main are currently quadratic in the number of local variables (3.11 is still linear). For example, a function with 100,000 assignments to unique local variables takes well over a minute to compile (but is less than 2 seconds on 3.11).

I believe the culprit is the new LOAD_FAST/LOAD_FAST_CHECK stuff. The current implementation appears to loop over every local variable, and for each one, loop over all bytecode instructions. This can probably be refactored to run in one pass for all locals.

CC: @markshannon @sweeneyde

sweeneyde commented 1 year ago

If it proves too complex to track multiple local variable states at the same time, another solution might be to only scan for the first hundred local variables or something like that.

It's not clear to me how easy tracking everything would be. Do we need potentially unbounded stack space, since blocks might need to be processed more than once? Would the approach still be linear in the face of many different del statements in unpredictable places? Maybe we don't care.

But if you manage to do it all linearly, that sounds great.

sweeneyde commented 1 year ago

Here's an example of a "hard" case:

def f():
    a1 = a2 = a3 = a4 = a5 = 1
    for i in range(2):
        if cond1(): use(a1)
        else: del a1

        if cond2(): use(a2)
        else: del a2

        if cond3(): use(a3)
        else: del a3

        if cond4(): use(a4)
        else: del a4

        if cond5(): use(a5)
        else: del a5

# (generalize to more than 5 locals...)

For example, the fact that del a3 may have happened has to propagate to the if cond4(), to if cond5(), to for i, to if cond1(), to if cond2(), to if cond3(), to use(a3). Combining that all together with the effects of all other del statements, there are n deletions that we have to move over at least n basicblock linkages each.

That's not a proof that linear time is impossible, just that there may be some cleverness required (moving info across multiple basicblock linkages at once?) if we want linear time in all cases. However, some sort of heuristic may be possible that takes care of most typical cases without many/any del statements.

By the way, the reason to propagate "possible undefinedness" rather than "certain definedness" is that any path with undefinedness leading into a LOAD_FAST is enough to have to use LOAD_FAST_CHECK.

sweeneyde commented 1 year ago

Here's a script that gathered co_nlocals data some from Sympy, which I figured might have some long code objects:

The script ```python from pathlib import Path from types import CodeType def all_code_recursive(code): # based on https://github.com/faster-cpython/tools/blob/main/scripts/count_opcodes.py yield code for x in code.co_consts: if isinstance(x, CodeType): yield from all_code_recursive(x) def all_code(root): for path in root.glob("**/*.py"): text = path.read_text("utf-8") code = compile(text, path.name, "exec") yield from all_code_recursive(code) from collections import Counter import sympy var_counts = Counter() init = Path(sympy.__file__) for code in all_code(init.parent): n = code.co_nlocals var_counts[code.co_nlocals] += 1 if code.co_nlocals > 50: print(code.co_qualname) print(dict(sorted(var_counts.items()))) total = var_counts.total() for bound in [2, 5, 10, 20, 50, 100, 200, 500, 1000]: bigger = sum(count for n, count in var_counts.items() if n > bound) print(f"{bigger/total:.2%} of code objects have >{bound} locals") ```
The output ``` test_DiagramGrid 53 _FixSimplify 144 _SimplifyAntiderivative 57 _TrigSimplifyAux 61 _ExpandIntegrand 200 binomial_products 1044 exponential 330 hyperbolic 1069 integrand_simplification 143 inverse_hyperbolic 1546 inverse_trig 1442 linear_products 485 logarithms 466 miscellaneous_algebraic 1204 miscellaneous_integration 221 miscellaneous_trig 877 piecewise_linear 64 quadratic_products 1291 secant 1727 sine 2836 special_functions 477 tangent 1208 trinomial_products 1022 test_binary_operators 60 test_var_decl 64 test_TransferFunction_functions 58 test_aux_dep 58 test_non_central_inertia 65 test_sub_qdot 59 test_bicycle 109 test_linearize_rolling_disc_kane 59 modgcd_bivariate 52 roots_quintic 67 _vertical_bisection 68 _horizontal_bisection 68 dup_isolate_complex_roots_sqf 71 powsimp 53 _solve 58 _solve_system 55 unrad 63 BinaryQuadratic.solve 57 classify_ode 51 test__classify_linear_system 56 test_sysode_linear_neq_order1_type1 53 test_sysode_linear_neq_order1_type1_slow 103 test_sysode_linear_neq_order1_type2 75 test_DiscreteMarkovChain 52 test_M38 56 {0: 10914, 1: 8656, 2: 13341, 3: 5899, 4: 3445, 5: 2496, 6: 2478, 7: 2358, 8: 2162, 9: 1656, 10: 1221, 11: 904, 12: 515, 13: 329, 14: 215, 15: 177, 16: 133, 17: 117, 18: 95, 19: 66, 20: 64, 21: 55, 22: 33, 23: 35, 24: 40, 25: 33, 26: 27, 27: 22, 28: 22, 29: 19, 30: 17, 31: 12, 32: 16, 33: 10, 34: 8, 35: 5, 36: 7, 37: 8, 38: 8, 39: 7, 40: 4, 41: 3, 42: 4, 43: 2, 44: 2, 45: 3, 46: 1, 47: 2, 49: 3, 50: 1, 51: 1, 52: 2, 53: 3, 55: 1, 56: 2, 57: 2, 58: 3, 59: 2, 60: 1, 61: 1, 63: 1, 64: 2, 65: 1, 67: 1, 68: 2, 71: 1, 75: 1, 103: 1, 109: 1, 143: 1, 144: 1, 200: 1, 221: 1, 330: 1, 466: 1, 477: 1, 485: 1, 877: 1, 1022: 1, 1044: 1, 1069: 1, 1204: 1, 1208: 1, 1291: 1, 1442: 1, 1546: 1, 1727: 1, 2836: 1} 66.08% of code objects have >1 locals 42.96% of code objects have >2 locals 22.44% of code objects have >5 locals 5.32% of code objects have >10 locals 0.79% of code objects have >20 locals 0.08% of code objects have >50 locals 0.04% of code objects have >100 locals 0.03% of code objects have >200 locals 0.02% of code objects have >500 locals 0.02% of code objects have >1000 locals 0.00% of code objects have >2000 locals 0.00% of code objects have >5000 locals ```

Summarized output:

For .py files in Sympy distribution:
------------------------------------
66.08% of code objects have >1 locals
42.96% of code objects have >2 locals
22.44% of code objects have >5 locals
5.32% of code objects have >10 locals
0.79% of code objects have >20 locals
0.08% of code objects have >50 locals
0.04% of code objects have >100 locals
0.03% of code objects have >200 locals
0.02% of code objects have >500 locals
0.02% of code objects have >1000 locals
0.00% of code objects have >2000 locals

Data from other places:

For .py files in Numpy distribution:
------------------------------------
66.77% of code objects have >1 locals
44.70% of code objects have >2 locals
15.82% of code objects have >5 locals
3.37% of code objects have >10 locals
0.38% of code objects have >20 locals
0.03% of code objects have >50 locals
0.00% of code objects have >100 locals

For .py files in Lib/test/ (ignoring SytaxError):
-------------------------------------------------
55.63% of code objects have >1 locals  
33.39% of code objects have >2 locals  
8.09% of code objects have >5 locals   
1.10% of code objects have >10 locals  
0.06% of code objects have >20 locals  
0.00% of code objects have >50 locals

For .py files in Django distribution:
-------------------------------------
61.79% of code objects have >1 locals
38.18% of code objects have >2 locals
11.70% of code objects have >5 locals
2.68% of code objects have >10 locals
0.32% of code objects have >20 locals
0.00% of code objects have >50 locals
JelleZijlstra commented 1 year ago

Thanks! The biggest one is https://github.com/sympy/sympy/blob/0e4bab73562d6f90fcdf5fa079731f26a455347c/sympy/integrals/rubi/rules/sine.py#L148, which looks terrifying. It is noticeably slower to compile on 3.12 than 3.11:

% time ./python.exe -m compileall sine.py
Compiling 'sine.py'...
./python.exe -m compileall sine.py  2.66s user 0.09s system 99% cpu 2.763 total
% time ~/py/venvs/311-jun14/bin/python3 -m compileall sine.py
Compiling 'sine.py'...
~/py/venvs/311-jun14/bin/python3 -m compileall sine.py  0.98s user 0.09s system 98% cpu 1.082 total
sweeneyde commented 1 year ago

One speedup might come from doing some precomputation for the "first pass": start by storing i --> array of *basicblocks where DELETE_FAST(i) occurs (often empty). This wouldn't guarantee linear time overall, but it would greatly speed up almost every first pass. A parameter variable with no del ... statements could be completely skipped for free.

The b->b_visited = 0; loop is also quadratic, but faster. We could instead widen b_visited to more than one bit (or add another field) and do something like

#define MAYBE_PUSH(B) do {                          \
        if ((B)->b_visited > target) {              \
            *(*stack_top)++ = (B);                  \
            (B)->b_visited = target + 1;            \
        }                                           \
    } while (0)

Though again, I'm not sure whether it would be better to just limit the number of locals to analyze and set the rest to use LOAD_FAST_CHECK. Or maybe do both.

If someone else hasn't already started a patch, I can work on this.

markshannon commented 1 year ago

It won't reduce the big-O complexity, but handling 64 locals at once (using an int64_t as a vector of 64 booleans) would reduce the constant factor by ~60

sweeneyde commented 1 year ago

This should be fixed now. Thanks for finding this!

(feel free to re-open if I missed something)