python / cpython

The Python programming language
https://www.python.org
Other
63.88k stars 30.57k forks source link

Generator finalization is slower in 3.11 vs 3.10 #100762

Open Onlooker2186 opened 1 year ago

Onlooker2186 commented 1 year ago

I found that the Python 3.11.1 implementation of all() is 30% slower compared to Python 3.10.9.

any() also seems to be around 4% slower on my device

Environment

Code to test all():

import timeit
from functools import partial
import platform

def find_by_keys(
    keys: list[str],
    table: list[dict[str, str | int]],
    match_data: dict[str, str | int],
) -> dict[str, str | int] | None:
    for item in table:
        if all(item[k] == match_data[k] for k in keys):
            return item
    return None

def main():
    keys: list[str] = ["id", "key_1", "key_2"]
    table: list[dict[str, str | int]] = [
        {
            "id": i,
            "key_1": "val_1",
            "key_2": "val_2",
        }
        for i in range(1, 5001)
    ]
    match_data: dict[str, str | int] = {
        "id": 3000,
        "key_1": "val_1",
        "key_2": "val_2",
    }

    timeit_output = timeit.repeat(
        partial(find_by_keys, keys, table, match_data), repeat=10000, number=1
    )

    average_time = sum(timeit_output) / len(timeit_output)
    best_time = min(timeit_output)
    tps_output = 1 / average_time

    print(f"Python version = {platform.python_version()}")
    print(f"Average time = {average_time}")
    print(f"Best time = {best_time}")
    print(f"Average transactions per second = {tps_output}")

if __name__ == "__main__":
    main()

Results for all()

Console output using Python 3.10.9:

Python version = 3.10.9
Average time = 0.0008256170657486655
Best time = 0.0007106999401003122
Average transactions per second = 1211.2152733824669

Console output using Python 3.11.1:

Python version = 3.11.1
Average time = 0.0011819988898839802
Best time = 0.001033599954098463
Average transactions per second = 846.0244832363215

Code to test any():

import timeit
from functools import partial
import platform

def find_by_keys(
    keys: list[str],
    table: list[dict[str, str | int]],
    match_data: dict[str, str | int],
) -> dict[str, str | int] | None:
    for item in table:
        if any(item[k] == match_data[k] for k in keys):
            return item
    return None

def main():
    keys: list[str] = ["id", "key_1", "key_2"]
    table: list[dict[str, str | int]] = [
        {
            "id": i,
            "key_1": "foo_1",
            "key_2": "foo_2",
        }
        for i in range(1, 5001)
    ]
    match_data: dict[str, str | int] = {
        "id": 3000,
        "key_1": "val_1",
        "key_2": "val_2",
    }

    timeit_output = timeit.repeat(
        partial(find_by_keys, keys, table, match_data), repeat=10000, number=1
    )

    average_time = sum(timeit_output) / len(timeit_output)
    best_time = min(timeit_output)
    tps_output = 1 / average_time

    print(f"Python version = {platform.python_version()}")
    print(f"Average time = {average_time}")
    print(f"Best time = {best_time}")
    print(f"Average transactions per second = {tps_output}")

if __name__ == "__main__":
    main()

Results for any()

Console output using Python 3.10.9:

Python version = 3.10.9
Average time = 0.0009300541202770546
Best time = 0.00090230000205338
Average transactions per second = 1075.206246817238

Console output using Python 3.11.1:

Python version = 3.11.1
Average time = 0.0009645268900785595
Best time = 0.000930099980905652
Average transactions per second = 1036.7777303943815

Linked PRs

eendebakpt commented 1 year ago

I can confirm this issue (comparing 3.10.8 to 3.12.0a3+). A shorter minimal example is:

import timeit
import platform

setup="""
def go( ) :
    item=[1,2,3]
    for ii in range(3000):
        all(item[k] == 2 for k in range(3))
""" 

cmd='go()'
timeit_output = timeit.repeat(
   cmd, setup=setup , repeat=500, number=10,
)

average_time = sum(timeit_output) / len(timeit_output)
best_time = min(timeit_output)
tps_output = 1 / average_time

print(f"Python version = {platform.python_version()}")
print(f"Average time = {average_time}")
print(f"Best time = {best_time}")
print(f"Average transactions per second = {tps_output}")

The expression inside the all(...) evaluates to a generator. Replacing this with all( [item[k] == 2 for k in range(3)] ) also shows a performance regression, but it is much smaller. Some more tests show:

        all(k for k in item) # has no regression
        all(k == 2 for k in item) # has a regression

With git bisect I tried to identify the commit where the regression is introduced, but I could not pinpoint it (there seem to be multiple regressions, and not all commits build on my system)

pochmann commented 1 year ago

What makes you blame all() instead of the generator and all it involves?

Do you see a difference with all((False,))? That's an iterable likewise starting with False that eliminates almost everything else.

@eendebakpt What times did you get?

eendebakpt commented 1 year ago

@pochmann I executed the the following code:

import pyperf
runner = pyperf.Runner()

for stmt in ['g=(k == 2 for k in item)','g=all(k for k in item)', 'g=all(k == 2 for k in item)' ]:
             time = runner.timeit(name=stmt, stmt=stmt, setup="item=[1,2,3]")

Results 3.10.9:

.....................
g=(k == 2 for k in item): Mean +- std dev: 321 ns +- 10 ns
.....................
g=(k == 2 for k in item); h=list(g): Mean +- std dev: 504 ns +- 7 ns
.....................
g=all(k for k in item): Mean +- std dev: 388 ns +- 9 ns
.....................
all((False,)): Mean +- std dev: 74.1 ns +- 2.3 ns
.....................
g=all(k == 2 for k in item): Mean +- std dev: 434 ns +- 8 ns

Results 3.12.0a3+:

.....................
g=(k == 2 for k in item): Mean +- std dev: 351 ns +- 20 ns
.....................
g=(k == 2 for k in item); h=list(g): Mean +- std dev: 442 ns +- 5 ns
.....................
g=all(k for k in item): Mean +- std dev: 325 ns +- 3 ns
.....................
all((False,)): Mean +- std dev: 45.5 ns +- 1.5 ns
.....................
g=all(k == 2 for k in item): Mean +- std dev: 567 ns +- 7 ns

I am not sure what to conclude from the numbers, but the last statement tested g=all(k == 2 for k in item) is the regression reported in this issue.

thatbirdguythatuknownot commented 1 year ago

There doesn't seem to be any difference in the implementation code of all() and any() between 3.10 and main. Maybe the generators are the issue?

mdboom commented 1 year ago

Yes, indeed signs point to generators (or maybe generator expressions specifically) more than all or any.

mdboom commented 1 year ago

EDIT: I wasn't measuring what I thought I was here. See below for corrected results.

I tried to see if the new overhead is in the setup/starting of the generator or the actual iteration (is the overhead constant per generator used, or scales with the number of items iterated over) and it seems it's maybe both...?

I modified this script so that the number of keys in the dictionary is configurable (using only 3 here seems like a hard case if the regression is in generator creation), and then ran it for powers of 2 from 2 to 2048. It definitely gets better as the number of keys goes up, but it never gets on par with 3.10.

image

Not sure what any of this means, but maybe some other faster CPython folks may have insight.

Code example with a variable number of keys ``` import timeit from functools import partial import sys def find_by_keys( keys: list[str], table: list[dict[str, str | int]], match_data: dict[str, str | int], ) -> dict[str, str | int] | None: for item in table: if all(item[k] == match_data[k] for k in keys): return item return None def main(): nkeys = int(sys.argv[-1]) keys: list[str] = ["id"] + [f"key_{i}" for i in range(nkeys)] table: list[dict[str, str | int]] = [] for i in range(1, 5001): d = { f"key_{j}": f"val_{j}" for j in range(nkeys) } d["id"] = i table.append(d) match_data = table[2999].copy() timeit_output = timeit.repeat( partial(find_by_keys, keys, table, match_data), repeat=10000, number=1 ) average_time = sum(timeit_output) / len(timeit_output) print(average_time) if __name__ == "__main__": main() ```
mdboom commented 1 year ago

Some additional information -- here's the differences in the bytecode.

3.10 bytecode ``` Disassembly of at 0x7f019adad630, file "/home/mdboom/Work/builds/tmp-generator-regression/test.py", line 13>: 0 GEN_START 0 13 2 LOAD_FAST 0 (.0) >> 4 FOR_ITER 11 (to 28) 14 6 STORE_FAST 1 (k) 8 LOAD_DEREF 0 (item) 10 LOAD_FAST 1 (k) 12 BINARY_SUBSCR 14 LOAD_DEREF 1 (match_data) 16 LOAD_FAST 1 (k) 18 BINARY_SUBSCR 20 COMPARE_OP 2 (==) 13 22 YIELD_VALUE 24 POP_TOP 26 JUMP_ABSOLUTE 2 (to 4) >> 28 LOAD_CONST 0 (None) 30 RETURN_VALUE ```
3.11 bytecode ``` Disassembly of at 0x7fc1302f9530, file "/home/mdboom/Work/builds/tmp-generator-regression/test.py", line 19>: 0 COPY_FREE_VARS 2 19 2 RETURN_GENERATOR 4 POP_TOP 6 RESUME_QUICK 0 8 LOAD_FAST 0 (.0) 10 FOR_ITER 22 (to 56) 12 STORE_FAST 1 (k) 14 LOAD_DEREF 2 (item) 16 LOAD_FAST 1 (k) 18 BINARY_SUBSCR_DICT 28 LOAD_DEREF 3 (match_data) 30 LOAD_FAST 1 (k) 32 BINARY_SUBSCR_DICT 42 COMPARE_OP 2 (==) 48 YIELD_VALUE 50 RESUME_QUICK 1 52 POP_TOP 54 JUMP_BACKWARD_QUICK 23 (to 10) >> 56 LOAD_CONST 0 (None) 58 RETURN_VALUE ```
brandtbucher commented 1 year ago

I think @mdboom's example is still dominated by generator creation, since the first 2999 loops will only advance the generator once, and the 3000th will advance it n times, then return.

Perhaps put "id" last in the list of keys, so that all will walk the full list of keys every time?

mdboom commented 1 year ago

Thanks, @brandtbucher. Indeed I was not measuring what I thought I was measuring. By putting id at the end and getting more iterations, I get a graph that makes a lot more sense.

As the number of iterations in the generator goes up, it trends toward "at par" between 3.10 and 3.11. So I think it's fair to say that whatever regression exists is due to generator startup, not the time spent iterating over it.

image

From the bytecode, it definitely is doing "more work" to start the generator, but I don't know if I could say what room there is for improvement.

Code to test with variable number of keys ``` import timeit from functools import partial import sys def find_by_keys( keys: list[str], table: list[dict[str, str | int]], match_data: dict[str, str | int], ) -> dict[str, str | int] | None: for item in table: if all(item[k] == match_data[k] for k in keys): break return None def main(): nkeys = int(sys.argv[-1]) keys: list[str] = [f"key_{i}" for i in range(nkeys)] + ["id"] table: list[dict[str, str | int]] = [] for i in range(1, 5001): d = { f"key_{j}": f"val_{j}" for j in range(nkeys) } d["id"] = str(i) table.append(d) match_data = table[2999].copy() timeit_output = timeit.repeat( partial(find_by_keys, keys, table, match_data), repeat=10000, number=1 ) average_time = sum(timeit_output) / len(timeit_output) print(average_time) if __name__ == "__main__": main() ```
brandtbucher commented 1 year ago

This was bugging me, so I dug into it a bit more by running some nano-benchmarks to measure the performance of different parts of the generator life-cycle. Here are the times to...

Create a new generator:

3.9:  1.57 us +- 0.15 us
3.10: 1.46 us +- 0.15 us
3.11:  357 ns +-   14 ns
3.12:  368 ns +-   18 ns

Yield from a generator:

3.9:  65.3 ns +- 4.1 ns
3.10: 59.0 ns +- 3.4 ns
3.11: 36.2 ns +- 5.5 ns
3.12: 32.7 ns +- 7.0 ns

Return from a generator:

3.9:  79.0 ns +- 8.8 ns
3.10: 61.8 ns +- 4.4 ns
3.11: 35.3 ns +- 5.0 ns
3.12: 33.7 ns +- 5.6 ns

Destroy a completed generator:

3.9:  45.8 ns +- 6.8 ns
3.10: 39.8 ns +- 6.2 ns
3.11: 34.1 ns +- 4.8 ns
3.12: 34.8 ns +- 4.5 ns

Destroy a suspended generator:

3.9:  182 ns +-  7 ns
3.10: 149 ns +- 10 ns
3.11: 181 ns +- 11 ns
3.12: 264 ns +- 11 ns

Code here:

```py import pyperf LOOPS = 1 << 18 def g(): yield def bench_gen_create(loops: int) -> float: it = range(loops) start = pyperf.perf_counter() gens = [g() for _ in it] return pyperf.perf_counter() - start def bench_gen_yield(loops: int) -> float: it = range(loops) gens = [g() for _ in it] start = pyperf.perf_counter() for gen in gens: for _ in gen: break return pyperf.perf_counter() - start def bench_gen_return(loops: int) -> float: it = range(loops) gens = [g() for _ in it] for gen in gens: for _ in gen: break start = pyperf.perf_counter() for gen in gens: for _ in gen: break return pyperf.perf_counter() - start def bench_gen_destroy_completed(loops: int) -> float: it = range(loops) gens = [g() for _ in it] for gen in gens: for _ in gen: break for gen in gens: for _ in gen: break start = pyperf.perf_counter() del gens return pyperf.perf_counter() - start def bench_gen_destroy_suspended(loops: int) -> float: it = range(loops) gens = [g() for _ in it] for gen in gens: for _ in gen: break start = pyperf.perf_counter() del gens return pyperf.perf_counter() - start if __name__ == "__main__": runner = pyperf.Runner(loops=LOOPS) runner.bench_time_func("gen_create", bench_gen_create) runner.bench_time_func("gen_yield", bench_gen_yield) runner.bench_time_func("gen_return", bench_gen_return) runner.bench_time_func("gen_destroy_completed", bench_gen_destroy_completed) runner.bench_time_func("gen_destroy_suspended", bench_gen_destroy_suspended) ```

The conclusion seems pretty clear: everything about generators has gotten significantly faster, except for finalizing suspended generators (or, in other words, gen.close()). This seems pretty consistent with the results in earlier comments.

While it's certainly a relief that generators in general haven't gotten slower, I still think the finalization issue deserves a closer look (especially since the situation seems much worse in 3.12).

Off the top of my head, I would guess it's due to our changes to exception handling (and/or frames) - the cost of raising exceptions has gotten higher in recent versions. Generators finalize themselves by throwing GeneratorExit into their suspended frame, which could be why we're seeing this slowdown.

markshannon commented 1 year ago

@brandtbucher Thanks for the analysis. However, it can't be finalization that's the problem. You can't finalize an object more than once, and the combined creation and finalization time is less in 3.11 than for 3.10

Times in µs Creation Finalization Total
3.10 1450 150 1600
3.11 360 180 540

So there must something else going on here.

markshannon commented 1 year ago

In the above example, the code is creating generator expressions, not plain generators. This adds additional overhead of creating a function, only to create the generator. Creating and deallocating a function hasn't changed much since 3.10, but we should probably check that as well.

brandtbucher commented 1 year ago

So there must something else going on here.

Looking back on the "create" benchmark, I think that the results are heavily skewed by the time taken to build a list of hundreds of thousands of new generators (benchmarking the creation, but not destruction, of generators is actually pretty tricky), as well as the time taken to look up the global name g.

Here are some new numbers that avoid these issues:

Create and destroy:

3.9:  168 ns +- 25 ns
3.10: 165 ns +- 25 ns
3.11: 217 ns +- 20 ns
3.12: 181 ns +- 25 ns

Create and destroy (expression):

3.9:  245 ns +- 29 ns
3.10: 257 ns +- 44 ns
3.11: 323 ns +- 26 ns
3.12: 299 ns +- 50 ns

Create, exhaust, and destroy:

3.9:  143 ns +- 17 ns
3.10: 140 ns +- 22 ns
3.11: 131 ns +- 33 ns
3.12: 119 ns +- 32 ns

Create, exhaust, and destroy (expression):

3.9:  188 ns +- 42 ns
3.10: 192 ns +- 44 ns
3.11: 209 ns +- 48 ns
3.12: 189 ns +- 36 ns

Code:

```py def g(): """Equivalient to: (_ for _ in ())""" for _ in (): yield _ def bench_gen_create_destroy(loops: int) -> float: it = range(loops) g_fast = g start = pyperf.perf_counter() for _ in it: g_fast() return pyperf.perf_counter() - start def bench_gen_expr_create_destroy(loops: int) -> float: it = range(loops) start = pyperf.perf_counter() for _ in it: (_ for _ in ()) return pyperf.perf_counter() - start def bench_gen_create_exhaust_destroy(loops: int) -> float: it = range(loops) g_fast = g start = pyperf.perf_counter() for _ in it: for _ in g_fast(): pass return pyperf.perf_counter() - start def bench_gen_expr_create_exhaust_destroy(loops: int) -> float: it = range(loops) start = pyperf.perf_counter() for _ in it: for _ in (_ for _ in ()): pass return pyperf.perf_counter() - start ```

You can clearly see that the time taken to create and destroy an unexhausted generator got much worse in 3.11, in a way that isn't reflected in the time taken to create, exhaust, and destroy the same generator. Also, the results are consistent across both "normal" generators and generator expressions... so I think that generator expressions themselves are a red herring here, and the issue is indeed finalization.

(The 3.12 improvements are likely irrelevant, and due to our specialization of the benchmarks' for loops over range iterators.)

eendebakpt commented 1 year ago

Profiling with valgrind shows that the difference between (_ for _ in () ) and

for _ in (_ for _ in () ):
    pass

is in gen_close. A screenshot of the relevant part for the (_ for _ in () ) case:

image

If we somehow avoid this expensive part for the common case that would be great. While analyzing the code I noticed that in the check

    if (PyErr_ExceptionMatches(PyExc_StopIteration)
        || PyErr_ExceptionMatches(PyExc_GeneratorExit)) {
    ...

at the end of gen_close we almost never have PyErr_ExceptionMatches(PyExc_StopIteration)==1 (in the entire python testing suite this happens only a few times (in test_generators and test_asyncgen)). Changing the order gives a small performance improvement.

Onlooker2186 commented 1 year ago

If this issue does get resolved via some patches to the main branch, is there any chance they can be applied to the publicly released Python 3.11 branch as well? Hoping to not wait another 12 months min to upgrade from Python 3.10

gvanrossum commented 1 year ago

If this issue does get resolved via some patches to the main branch, is there any chance they can be applied to the publicly released Python 3.11 branch as well? Hoping to not wait another 12 months min to upgrade from Python 3.10

There's definitely a chance. It just depends on how complicated the fix is. We don't have a solution yet, so I can't comment further.

brandtbucher commented 1 year ago

The proposed fix in https://github.com/python/cpython/pull/101013 changes the bytecode, but only because it's necessary to handle other changes we made to how generators work in 3.12.

The basic idea (don't call close() if there aren't any active exception table entries) can probably be backported to 3.11 with minimal fuss.

brandtbucher commented 1 year ago

...that is, if we consider this a bad enough performance regression to call it a "bug".

I'm on the fence, myself, since this is just a consequence of all exceptions becoming more expensive to raise in Python code, not anything specific to generators. But, it may impact async codebases in a real way... I've not done enough async programming to know how common it is for apps to abandon "unfinished" coroutines, for instance.

gvanrossum commented 1 year ago

Abandoning an unfinished coroutine is likely a bug -- most coroutines are wrapped in tasks and we usually complain about abandoned tasks, so users are used to seeing abandonment as an anti-pattern.

Throwing exceptions into coroutines must definitely work (e.g. cancellation) but I don't think it's a major speed concern.

I'm curious if the OP (@Onlooker2186) has found this while debugging a real perf regression (in production code)? Esp. since the scope of the regression is much narrower than the original comment implied -- it's not that all() as a whole is 30% slower, it's slower in the specific case where it returns False after a small number of iterations. (Or for any(), returning True.)

Onlooker2186 commented 1 year ago

@gvanrossum Yeah it was production code for a websocket client implementation to keep track of a large number of order books. That code in the opening post is used to identify key value pairs to after an "update" or "delete" event received from the websocket server, and is the main cause of regression in total performance for the overall implementation.

gvanrossum commented 1 year ago

Got it. That must have been frustrating.

If you want a backport could you help ?

Onlooker2186 commented 1 year ago

Got it. That must have been frustrating.

If you want a backport could you help ?

For a backport, I probably wont be of much use since I don't know any C and if building Python etc assumes that knowledge. Purely just a user of Python so far :)

markshannon commented 1 year ago

I believe this is fixed. @Onlooker2186 can you confirm?

Onlooker2186 commented 1 year ago

I believe this is fixed. @Onlooker2186 can you confirm?

Thanks for the update @markshannon . Should I be using 3.11.3 to test this?

Testing the all() function - the performance regression seems to still be there on 3.11.3. I've also put the timings for 3.12.0a7 below.

Python version = 3.10.9
Average time = 0.0008305954898474738
Best time = 0.0007323999889194965
Average transactions per second = 1203.9554900347878
Python version = 3.11.3
Average time = 0.0011983903799438849
Best time = 0.0010432000271975994
Average transactions per second = 834.4526264027799
Python version = 3.12.0a7
Average time = 0.0012928771821549162
Best time = 0.0011587999761104584
Average transactions per second = 773.4686742117607
brandtbucher commented 1 year ago

This isn't fixed in 3.11... the PR (#101316) is failing for some reason.

It should be fixed in 3.12, though, since that's where the change landed.

brandtbucher commented 1 year ago

(I think it landed in a5, so the comparison to test would be between 3.12.0a4 and 3.12.0a5.)

But I can see from your comment that things appear to have gotten worse in 3.12. Are these all release builds with the same build options (PGO, etc.)?

brandtbucher commented 1 year ago

I can confirm that there still appears to be a regression:

Python version = 3.10.8+
Average time = 0.0010208233151759486
Best time = 0.0009102780022658408
Average transactions per second = 979.6014502545335

Python version = 3.11.1+
Average time = 0.0013533004376047758
Best time = 0.0012006000033579767
Average transactions per second = 738.934217571017

Python version = 3.12.0a7+
Average time = 0.0015747777904500254
Best time = 0.0013656059745699167
Average transactions per second = 635.0102256104522
markshannon commented 1 year ago

What are those times for?

brandtbucher commented 1 year ago

I just used the script from the very first comment on the issue.

markshannon commented 1 year ago

There are a few things going on here.

First of all, the workaround is to inline the generator code. This is much faster on all versions:

def find_by_keys2(
    keys: list[str],
    table: list[dict[str, str | int]],
    match_data: dict[str, str | int],
) -> dict[str, str | int] | None:
    for item in table:
        for k in keys:
            if item[k] != match_data[k]:
                break
        else:
            return item
    return None

The overhead of destroying the generator is much reduced in 3.12, but the cost of switching between C and Python code is increased in 3.11 compared to 3.10, and seems to be even worse in 3.12.

This will be much improved in future, as we optimize larger regions and convert all, any, etc. into bytecode.