python / cpython

The Python programming language
https://www.python.org
Other
62.5k stars 30k forks source link

slowdown in 3.12 iteration microbenchmark #113041

Open ajoino opened 9 months ago

ajoino commented 9 months ago

Bug report

Bug description:

I decided to install 3.12 from deadsnakes to play around and test how much faster some simple code would be with the inlined comprehensions (got inspired after listening to Carl's appearance in the core.py podcast.)

I tried running this code with both 3.11.7 and 3.12.1

import timeit

res = timeit.timeit("[a * 2 for a in range(1_000_000)]", number=100)

print(f"{res = }")

and I consistently get a time of around 2.8 seconds for 3.11 and a time of around 2.9 - 3.1 seconds on 3.12. The variability for 3.11 was around 0.06 seconds and for 3.12 was around 0.2 seconds.

Not sure if this is to be expected considering the large changes made between versions, but intuitively this example should at least not be worse I think?

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

itamaro commented 9 months ago

I am definitely not an expert on comprehension performance, but if my understanding is correct, your microbenchmark is dominated by the 1,000,000 iterations. The mechanics of the comprehension itself (whether inlined or with a single call to an anonymous function created once) would be negligible compared to doing 1,000,000 times of anything.

Check out the microbenchmark from the PEP (https://peps.python.org/pep-0709/#reference-implementation) for reference.

I'm not sure why 3.12 is slower for this microbenchmark - I agree this is unexpected - but I don't think it's related to comprehension inlining. cc @carljm

carljm commented 9 months ago

Yes, Itamar is right; I would not expect to see any improvement from comprehension inlining in this microbenchmark, because the comprehension-execution overhead is dominated by the cost of the comprehension loop itself. I also wouldn't expect to see any slowdown from comprehension inlining; there may be an unexpected effect, or the slowdown may be coming from some other change in 3.12 (immortal objects?). I will do some profiling to see if I can trace it. Thanks for the report!

carljm commented 9 months ago

I created this microbenchmarks module:

➜ cat bench.py
def comp():
    return [a * 2 for a in range(1_000_000)]

def loop():
    l = []
    for a in range(1_000_000):
        l.append(a * 2)
    return l

And then I compared the performance of these four commands:

➜ python3.11 -m timeit -s "import bench" "bench.comp()"
➜ python3.12 -m timeit -s "import bench" "bench.comp()"
➜ python3.11 -m timeit -s "import bench" "bench.loop()"
➜ python3.12 -m timeit -s "import bench" "bench.loop()"

Like you, I found a small slowdown from 3.11 to 3.12 in the comprehension loop. But the same slowdown also reproduced in the for-loop variant. It also reproduced in both variants if I replace a * 2 with simply a.

My conclusion is that there is a small regression in the performance of tight loops (which is really what this microbenchmark tests) from 3.11 to 3.12, but it doesn't seem to be related to comprehensions.

I'm currently on a Mac; when I have access to Linux host again I'll use perf to see if I can narrow down the source of the 3.12 regression. My initial guess remains immortal objects, which was known to have something in the range of a 1-2% performance cost.

gvanrossum commented 9 months ago

I see it too, on Mac M3 -- 3.11 gives about 23 msec, 3.12 about 26, and 3.13 about 25. (I only checked loop())

If I replace the body of the for-loop with pass I get 7.6, 8.6 and 9.3. This makes me think that 3.13 has some extra speedup for either l.append() or for a * 2, but the looping constructs are slower. An empty loop has only three bytecodes:

3.11

             36 FOR_ITER                 2 (to 42)
             38 STORE_FAST               1 (a)

  8          40 JUMP_BACKWARD_QUICK      3 (to 36)

3.12

        >>   28 FOR_ITER_RANGE           2 (to 36)
             32 STORE_FAST               1 (a)

  8          34 JUMP_BACKWARD            4 (to 28)

3.13

        >>   28 FOR_ITER_RANGE           3 (to 38)
             32 STORE_FAST               1 (a)

  8          34 JUMP_BACKWARD            5 (to 28)

Of these, probably the JUMP_BACKWARD opcode has changed the most, in particular the CHECK_EVAL_BREAKER() macro is different in each release. Maybe this provides a hint.

gvanrossum commented 9 months ago

Of course, opcode dispatch itself could also have slowed down.