devdanzin commented 5 months ago

Bug report

Bug description:

There seems to be a significant performance regression in tokenize.generate_tokens() between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.

To reproduce, rename the file largedict.py.txt to largedict.py in the same directory as the script below, then run the script. That file comes from https://github.com/nedbat/coveragepy/issues/1785.

import io, time, sys, tokenize
import largedict

text = largedict.d
readline = io.StringIO(text).readline
glob_start = start = time.time()
print(f"{sys.implementation.name} {sys.platform} {sys.version}")

for i, (ttype, ttext, (sline, scol), (_, ecol), _) in enumerate(tokenize.generate_tokens(readline)):
    if i % 500 == 0:
        print(i, ttype, ttext, sline, scol, time.time() - start)
        start = time.time()
    if i % 5000 == 0:
        print(time.time() - glob_start)

print(f"Time taken: {time.time() - glob_start}")

For Python 3.12, this results in:

cpython linux 3.12.3 (main, May 17 2024, 07:19:22) [GCC 11.4.0]
0 1 a_large_dict_literal 1 0 0.04641866683959961
0.046633005142211914
500 3 ':tombol_a_(golongan_darah):' 1 2675 9.689745903015137
1000 3 ':flagge_anguilla:' 1 5261 9.767053604125977
1500 3 ':флаг_Армения:' 1 7879 9.258271932601929
[...]

For Python 3.11, this results in:

cpython linux 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
0 1 a_large_dict_literal 1 0 0.013637304306030273
0.013663768768310547
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.002939462661743164
1000 3 ':flagge_anguilla:' 1 5261 0.0028715133666992188
1500 3 ':флаг_Армения:' 1 7879 0.002806425094604492
[...]
352500 3 'pt' 1 2589077 0.003370046615600586
Time taken: 2.1244866847991943

That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.

I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

gh-119615
gh-119682
gh-119683

erlend-aasland commented 5 months ago

Suspecting it could be related to the work done on PEP-701, which landed in 3.12.

pablogsal commented 5 months ago

Yeah we are now using a completely different backend (the actual parser) so we need to investigate a bit what's going on here.

devdanzin commented 5 months ago

A data point that might help: when calling list(tokenize.generate_tokens(readline)) (or tuple(), memory blows out very fast with a large single line dict, while the same dict broken into lines doesn't cause memory issues.

pablogsal commented 5 months ago

Oh wow that's very intriguing. I may be able to investigate this next week after PyCon but thanks for the hint!

pablogsal commented 5 months ago

Ok, I think I have a guess (i need time to test my theory) but I am 99% sure I know the problem. The problem is this line:

https://github.com/python/cpython/blob/31a28cbae0989f57ad01b428c007dade24d9593a/Python/Python-tokenize.c#L213

We are creating unique strings per token with the entire line. If the dict spans the entire line this is a very expensive operation that we are doing over and over and over with the same result. The fix is to keep reusing the same result until there is a new line.

erlend-aasland commented 5 months ago

~~Confirming that, Pablo; if I cache the line[^1] on the tokenizer iter object instance, I no longer see the performance penalty.~~ Forget that, it's not that simple.

[^1]: the PyObject *line object

lysnikolaou commented 5 months ago

Well, yeah, it's not that simple. Doing some scrappy caching seems to have brought down the performance penalty, but it's much worse than 3.11.

cpython on  main +16 -1 [!] via C v15.0.0-clang via 🐍 v3.11.3 
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 15:17:08) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.07759904861450195
0.07767128944396973
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.08086895942687988
1000 3 ':flagge_anguilla:' 1 5261 0.10818815231323242
1500 3 ':флаг_Армения:' 1 7879 0.1338660717010498
2000 3 ':grupo_sanguíneo_b:' 1 10470 0.1593310832977295
2500 3 ':bayrak_belarus:' 1 13074 0.18597793579101562
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 15514 0.2100069522857666
3500 3 ':Bulgaria:' 1 18735 0.23893284797668457
4000 3 ':bandiera_camerun:' 1 21240 0.2700800895690918
4500 2 2 1 23919 0.2944760322570801
5000 3 ':bendera_pulau_natal:' 1 26767 0.32537102699279785
2.0853629112243652
5500 3 'es' 1 29643 0.3519558906555176
6000 3 'ar' 1 32240 0.38205909729003906
6500 3 'ko' 1 34917 0.411639928817749
7000 2 2 1 37580 0.45705175399780273
7500 3 'fa' 1 40214 0.4636220932006836
8000 55 , 1 43013 0.49083518981933594
8500 3 ':加纳:' 1 45615 0.5203256607055664
9000 3 ':bandera_guatemala:' 1 48073 0.5623579025268555
...

So we get about 0.1-0.5 seconds per 500 tokens, instead of the original ~9 seconds, but it's still very far away.

lysnikolaou commented 5 months ago

Oh found it, it's not just that line, but the calls to _PyPegen_byte_offset_to_character_offset that are made a bit further below that eventually call PyUnicode_* APIs as well. After additionally caching those, I get:

cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 16:37:58) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.08269691467285156
0.08281183242797852
500 3 ':tombol_a_(golongan_darah):' 1 0 0.004801034927368164
1000 3 ':flagge_anguilla:' 1 0 0.003465890884399414
1500 3 ':флаг_Армения:' 1 0 0.004238128662109375
2000 3 ':grupo_sanguíneo_b:' 1 0 0.0038290023803710938
2500 3 ':bayrak_belarus:' 1 0 0.005483865737915039
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 0 0.005335807800292969
3500 3 ':Bulgaria:' 1 0 0.0034456253051757812
4000 3 ':bandiera_camerun:' 1 0 0.0051038265228271484
4500 2 2 1 0 0.003175020217895508
5000 3 ':bendera_pulau_natal:' 1 0 0.0032072067260742188
0.1256239414215088
5500 3 'es' 1 0 0.003175973892211914
[...]
351500 3 'es' 1 0 0.0032269954681396484
352000 3 'ko' 1 0 0.003362894058227539
352500 3 'pt' 1 0 0.003036022186279297
Time taken: 2.6852049827575684

I'll push a PR shortly.

nedbat commented 5 months ago

@lysnikolaou any news on this? We have other people reporting a problem like this in 3.12.

lysnikolaou commented 5 months ago

Thanks for the reminder @nedbat! Turns out that this was even trickier than I though and the fix I showed in https://github.com/python/cpython/issues/119118#issuecomment-2118355032 was buggy. My patch is almost there, but I didn't get the time to fix some final details. I'll have a PR up by early next week.

pablogsal commented 5 months ago

Tell me if you want me to help or pick it up if you done have time @lysnikolaou

lysnikolaou commented 5 months ago

Fixed in https://github.com/python/cpython/pull/119615.

python / cpython

`tokenize.generate_tokens()` performance regression in 3.12 #119118

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs