Open Onlooker2186 opened 1 year ago
I can confirm this issue (comparing 3.10.8 to 3.12.0a3+). A shorter minimal example is:
import timeit
import platform
setup="""
def go( ) :
item=[1,2,3]
for ii in range(3000):
all(item[k] == 2 for k in range(3))
"""
cmd='go()'
timeit_output = timeit.repeat(
cmd, setup=setup , repeat=500, number=10,
)
average_time = sum(timeit_output) / len(timeit_output)
best_time = min(timeit_output)
tps_output = 1 / average_time
print(f"Python version = {platform.python_version()}")
print(f"Average time = {average_time}")
print(f"Best time = {best_time}")
print(f"Average transactions per second = {tps_output}")
The expression inside the all(...)
evaluates to a generator. Replacing this with all( [item[k] == 2 for k in range(3)] )
also shows a performance regression, but it is much smaller.
Some more tests show:
all(k for k in item) # has no regression
all(k == 2 for k in item) # has a regression
With git bisect I tried to identify the commit where the regression is introduced, but I could not pinpoint it (there seem to be multiple regressions, and not all commits build on my system)
What makes you blame all()
instead of the generator and all it involves?
Do you see a difference with all((False,))
? That's an iterable likewise starting with False
that eliminates almost everything else.
@eendebakpt What times did you get?
@pochmann I executed the the following code:
import pyperf
runner = pyperf.Runner()
for stmt in ['g=(k == 2 for k in item)','g=all(k for k in item)', 'g=all(k == 2 for k in item)' ]:
time = runner.timeit(name=stmt, stmt=stmt, setup="item=[1,2,3]")
Results 3.10.9:
.....................
g=(k == 2 for k in item): Mean +- std dev: 321 ns +- 10 ns
.....................
g=(k == 2 for k in item); h=list(g): Mean +- std dev: 504 ns +- 7 ns
.....................
g=all(k for k in item): Mean +- std dev: 388 ns +- 9 ns
.....................
all((False,)): Mean +- std dev: 74.1 ns +- 2.3 ns
.....................
g=all(k == 2 for k in item): Mean +- std dev: 434 ns +- 8 ns
Results 3.12.0a3+:
.....................
g=(k == 2 for k in item): Mean +- std dev: 351 ns +- 20 ns
.....................
g=(k == 2 for k in item); h=list(g): Mean +- std dev: 442 ns +- 5 ns
.....................
g=all(k for k in item): Mean +- std dev: 325 ns +- 3 ns
.....................
all((False,)): Mean +- std dev: 45.5 ns +- 1.5 ns
.....................
g=all(k == 2 for k in item): Mean +- std dev: 567 ns +- 7 ns
I am not sure what to conclude from the numbers, but the last statement tested g=all(k == 2 for k in item)
is the regression reported in this issue.
There doesn't seem to be any difference in the implementation code of all()
and any()
between 3.10
and main
. Maybe the generators are the issue?
Yes, indeed signs point to generators (or maybe generator expressions specifically) more than all
or any
.
EDIT: I wasn't measuring what I thought I was here. See below for corrected results.
I tried to see if the new overhead is in the setup/starting of the generator or the actual iteration (is the overhead constant per generator used, or scales with the number of items iterated over) and it seems it's maybe both...?
I modified this script so that the number of keys in the dictionary is configurable (using only 3 here seems like a hard case if the regression is in generator creation), and then ran it for powers of 2 from 2 to 2048. It definitely gets better as the number of keys goes up, but it never gets on par with 3.10.
Not sure what any of this means, but maybe some other faster CPython folks may have insight.
Some additional information -- here's the differences in the bytecode.
at 0x7f019adad630, file "/home/mdboom/Work/builds/tmp-generator-regression/test.py", line 13>:
0 GEN_START 0
13 2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 11 (to 28)
14 6 STORE_FAST 1 (k)
8 LOAD_DEREF 0 (item)
10 LOAD_FAST 1 (k)
12 BINARY_SUBSCR
14 LOAD_DEREF 1 (match_data)
16 LOAD_FAST 1 (k)
18 BINARY_SUBSCR
20 COMPARE_OP 2 (==)
13 22 YIELD_VALUE
24 POP_TOP
26 JUMP_ABSOLUTE 2 (to 4)
>> 28 LOAD_CONST 0 (None)
30 RETURN_VALUE
```
at 0x7fc1302f9530, file "/home/mdboom/Work/builds/tmp-generator-regression/test.py", line 19>:
0 COPY_FREE_VARS 2
19 2 RETURN_GENERATOR
4 POP_TOP
6 RESUME_QUICK 0
8 LOAD_FAST 0 (.0)
10 FOR_ITER 22 (to 56)
12 STORE_FAST 1 (k)
14 LOAD_DEREF 2 (item)
16 LOAD_FAST 1 (k)
18 BINARY_SUBSCR_DICT
28 LOAD_DEREF 3 (match_data)
30 LOAD_FAST 1 (k)
32 BINARY_SUBSCR_DICT
42 COMPARE_OP 2 (==)
48 YIELD_VALUE
50 RESUME_QUICK 1
52 POP_TOP
54 JUMP_BACKWARD_QUICK 23 (to 10)
>> 56 LOAD_CONST 0 (None)
58 RETURN_VALUE
```
I think @mdboom's example is still dominated by generator creation, since the first 2999 loops will only advance the generator once, and the 3000th will advance it n times, then return.
Perhaps put "id"
last in the list of keys, so that all
will walk the full list of keys every time?
Thanks, @brandtbucher. Indeed I was not measuring what I thought I was measuring. By putting id
at the end and getting more iterations, I get a graph that makes a lot more sense.
As the number of iterations in the generator goes up, it trends toward "at par" between 3.10 and 3.11. So I think it's fair to say that whatever regression exists is due to generator startup, not the time spent iterating over it.
From the bytecode, it definitely is doing "more work" to start the generator, but I don't know if I could say what room there is for improvement.
This was bugging me, so I dug into it a bit more by running some nano-benchmarks to measure the performance of different parts of the generator life-cycle. Here are the times to...
Create a new generator:
3.9: 1.57 us +- 0.15 us
3.10: 1.46 us +- 0.15 us
3.11: 357 ns +- 14 ns
3.12: 368 ns +- 18 ns
Yield from a generator:
3.9: 65.3 ns +- 4.1 ns
3.10: 59.0 ns +- 3.4 ns
3.11: 36.2 ns +- 5.5 ns
3.12: 32.7 ns +- 7.0 ns
Return from a generator:
3.9: 79.0 ns +- 8.8 ns
3.10: 61.8 ns +- 4.4 ns
3.11: 35.3 ns +- 5.0 ns
3.12: 33.7 ns +- 5.6 ns
Destroy a completed generator:
3.9: 45.8 ns +- 6.8 ns
3.10: 39.8 ns +- 6.2 ns
3.11: 34.1 ns +- 4.8 ns
3.12: 34.8 ns +- 4.5 ns
Destroy a suspended generator:
3.9: 182 ns +- 7 ns
3.10: 149 ns +- 10 ns
3.11: 181 ns +- 11 ns
3.12: 264 ns +- 11 ns
Code here:
The conclusion seems pretty clear: everything about generators has gotten significantly faster, except for finalizing suspended generators (or, in other words, gen.close()
). This seems pretty consistent with the results in earlier comments.
While it's certainly a relief that generators in general haven't gotten slower, I still think the finalization issue deserves a closer look (especially since the situation seems much worse in 3.12).
Off the top of my head, I would guess it's due to our changes to exception handling (and/or frames) - the cost of raising exceptions has gotten higher in recent versions. Generators finalize themselves by throwing GeneratorExit
into their suspended frame, which could be why we're seeing this slowdown.
@brandtbucher Thanks for the analysis. However, it can't be finalization that's the problem. You can't finalize an object more than once, and the combined creation and finalization time is less in 3.11 than for 3.10
Times in µs | Creation | Finalization | Total |
---|---|---|---|
3.10 | 1450 | 150 | 1600 |
3.11 | 360 | 180 | 540 |
So there must something else going on here.
In the above example, the code is creating generator expressions, not plain generators. This adds additional overhead of creating a function, only to create the generator. Creating and deallocating a function hasn't changed much since 3.10, but we should probably check that as well.
So there must something else going on here.
Looking back on the "create" benchmark, I think that the results are heavily skewed by the time taken to build a list of hundreds of thousands of new generators (benchmarking the creation, but not destruction, of generators is actually pretty tricky), as well as the time taken to look up the global name g
.
Here are some new numbers that avoid these issues:
Create and destroy:
3.9: 168 ns +- 25 ns
3.10: 165 ns +- 25 ns
3.11: 217 ns +- 20 ns
3.12: 181 ns +- 25 ns
Create and destroy (expression):
3.9: 245 ns +- 29 ns
3.10: 257 ns +- 44 ns
3.11: 323 ns +- 26 ns
3.12: 299 ns +- 50 ns
Create, exhaust, and destroy:
3.9: 143 ns +- 17 ns
3.10: 140 ns +- 22 ns
3.11: 131 ns +- 33 ns
3.12: 119 ns +- 32 ns
Create, exhaust, and destroy (expression):
3.9: 188 ns +- 42 ns
3.10: 192 ns +- 44 ns
3.11: 209 ns +- 48 ns
3.12: 189 ns +- 36 ns
Code:
You can clearly see that the time taken to create and destroy an unexhausted generator got much worse in 3.11, in a way that isn't reflected in the time taken to create, exhaust, and destroy the same generator. Also, the results are consistent across both "normal" generators and generator expressions... so I think that generator expressions themselves are a red herring here, and the issue is indeed finalization.
(The 3.12 improvements are likely irrelevant, and due to our specialization of the benchmarks' for
loops over range
iterators.)
Profiling with valgrind shows that the difference between (_ for _ in () )
and
for _ in (_ for _ in () ):
pass
is in gen_close
. A screenshot of the relevant part for the (_ for _ in () )
case:
If we somehow avoid this expensive part for the common case that would be great. While analyzing the code I noticed that in the check
if (PyErr_ExceptionMatches(PyExc_StopIteration)
|| PyErr_ExceptionMatches(PyExc_GeneratorExit)) {
...
at the end of gen_close
we almost never have PyErr_ExceptionMatches(PyExc_StopIteration)==1
(in the entire python testing suite this happens only a few times (in test_generators
and test_asyncgen
)). Changing the order gives a small performance improvement.
If this issue does get resolved via some patches to the main branch, is there any chance they can be applied to the publicly released Python 3.11 branch as well? Hoping to not wait another 12 months min to upgrade from Python 3.10
If this issue does get resolved via some patches to the main branch, is there any chance they can be applied to the publicly released Python 3.11 branch as well? Hoping to not wait another 12 months min to upgrade from Python 3.10
There's definitely a chance. It just depends on how complicated the fix is. We don't have a solution yet, so I can't comment further.
The proposed fix in https://github.com/python/cpython/pull/101013 changes the bytecode, but only because it's necessary to handle other changes we made to how generators work in 3.12.
The basic idea (don't call close()
if there aren't any active exception table entries) can probably be backported to 3.11 with minimal fuss.
...that is, if we consider this a bad enough performance regression to call it a "bug".
I'm on the fence, myself, since this is just a consequence of all exceptions becoming more expensive to raise in Python code, not anything specific to generators. But, it may impact async codebases in a real way... I've not done enough async programming to know how common it is for apps to abandon "unfinished" coroutines, for instance.
Abandoning an unfinished coroutine is likely a bug -- most coroutines are wrapped in tasks and we usually complain about abandoned tasks, so users are used to seeing abandonment as an anti-pattern.
Throwing exceptions into coroutines must definitely work (e.g. cancellation) but I don't think it's a major speed concern.
I'm curious if the OP (@Onlooker2186) has found this while debugging a real perf regression (in production code)? Esp. since the scope of the regression is much narrower than the original comment implied -- it's not that all()
as a whole is 30% slower, it's slower in the specific case where it returns False
after a small number of iterations. (Or for any()
, returning True
.)
@gvanrossum Yeah it was production code for a websocket client implementation to keep track of a large number of order books. That code in the opening post is used to identify key value pairs to after an "update" or "delete" event received from the websocket server, and is the main cause of regression in total performance for the overall implementation.
Got it. That must have been frustrating.
If you want a backport could you help ?
Got it. That must have been frustrating.
If you want a backport could you help ?
For a backport, I probably wont be of much use since I don't know any C and if building Python etc assumes that knowledge. Purely just a user of Python so far :)
I believe this is fixed. @Onlooker2186 can you confirm?
I believe this is fixed. @Onlooker2186 can you confirm?
Thanks for the update @markshannon . Should I be using 3.11.3 to test this?
Testing the all() function - the performance regression seems to still be there on 3.11.3. I've also put the timings for 3.12.0a7 below.
Python version = 3.10.9
Average time = 0.0008305954898474738
Best time = 0.0007323999889194965
Average transactions per second = 1203.9554900347878
Python version = 3.11.3
Average time = 0.0011983903799438849
Best time = 0.0010432000271975994
Average transactions per second = 834.4526264027799
Python version = 3.12.0a7
Average time = 0.0012928771821549162
Best time = 0.0011587999761104584
Average transactions per second = 773.4686742117607
This isn't fixed in 3.11... the PR (#101316) is failing for some reason.
It should be fixed in 3.12, though, since that's where the change landed.
(I think it landed in a5, so the comparison to test would be between 3.12.0a4 and 3.12.0a5.)
But I can see from your comment that things appear to have gotten worse in 3.12. Are these all release builds with the same build options (PGO, etc.)?
I can confirm that there still appears to be a regression:
Python version = 3.10.8+
Average time = 0.0010208233151759486
Best time = 0.0009102780022658408
Average transactions per second = 979.6014502545335
Python version = 3.11.1+
Average time = 0.0013533004376047758
Best time = 0.0012006000033579767
Average transactions per second = 738.934217571017
Python version = 3.12.0a7+
Average time = 0.0015747777904500254
Best time = 0.0013656059745699167
Average transactions per second = 635.0102256104522
What are those times for?
I just used the script from the very first comment on the issue.
There are a few things going on here.
First of all, the workaround is to inline the generator code. This is much faster on all versions:
def find_by_keys2(
keys: list[str],
table: list[dict[str, str | int]],
match_data: dict[str, str | int],
) -> dict[str, str | int] | None:
for item in table:
for k in keys:
if item[k] != match_data[k]:
break
else:
return item
return None
The overhead of destroying the generator is much reduced in 3.12, but the cost of switching between C and Python code is increased in 3.11 compared to 3.10, and seems to be even worse in 3.12.
This will be much improved in future, as we optimize larger regions and convert all
, any
, etc. into bytecode.
I found that the Python 3.11.1 implementation of all() is 30% slower compared to Python 3.10.9.
any() also seems to be around 4% slower on my device
Environment
Code to test all():
Results for all()
Console output using Python 3.10.9:
Console output using Python 3.11.1:
Code to test any():
Results for any()
Console output using Python 3.10.9:
Console output using Python 3.11.1:
Linked PRs