Closed devdanzin closed 5 months ago
Suspecting it could be related to the work done on PEP-701, which landed in 3.12.
Yeah we are now using a completely different backend (the actual parser) so we need to investigate a bit what's going on here.
A data point that might help: when calling list(tokenize.generate_tokens(readline))
(or tuple()
, memory blows out very fast with a large single line dict, while the same dict broken into lines doesn't cause memory issues.
Oh wow that's very intriguing. I may be able to investigate this next week after PyCon but thanks for the hint!
Ok, I think I have a guess (i need time to test my theory) but I am 99% sure I know the problem. The problem is this line:
We are creating unique strings per token with the entire line. If the dict spans the entire line this is a very expensive operation that we are doing over and over and over with the same result. The fix is to keep reusing the same result until there is a new line.
Confirming that, Pablo; if I cache the line[^1] on the tokenizer iter object instance, I no longer see the performance penalty. Forget that, it's not that simple.
[^1]: the PyObject *line
object
Well, yeah, it's not that simple. Doing some scrappy caching seems to have brought down the performance penalty, but it's much worse than 3.11.
cpython on main +16 -1 [!] via C v15.0.0-clang via 🐍 v3.11.3
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 15:17:08) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.07759904861450195
0.07767128944396973
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.08086895942687988
1000 3 ':flagge_anguilla:' 1 5261 0.10818815231323242
1500 3 ':флаг_Армения:' 1 7879 0.1338660717010498
2000 3 ':grupo_sanguíneo_b:' 1 10470 0.1593310832977295
2500 3 ':bayrak_belarus:' 1 13074 0.18597793579101562
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 15514 0.2100069522857666
3500 3 ':Bulgaria:' 1 18735 0.23893284797668457
4000 3 ':bandiera_camerun:' 1 21240 0.2700800895690918
4500 2 2 1 23919 0.2944760322570801
5000 3 ':bendera_pulau_natal:' 1 26767 0.32537102699279785
2.0853629112243652
5500 3 'es' 1 29643 0.3519558906555176
6000 3 'ar' 1 32240 0.38205909729003906
6500 3 'ko' 1 34917 0.411639928817749
7000 2 2 1 37580 0.45705175399780273
7500 3 'fa' 1 40214 0.4636220932006836
8000 55 , 1 43013 0.49083518981933594
8500 3 ':加纳:' 1 45615 0.5203256607055664
9000 3 ':bandera_guatemala:' 1 48073 0.5623579025268555
...
So we get about 0.1-0.5 seconds per 500 tokens, instead of the original ~9 seconds, but it's still very far away.
Oh found it, it's not just that line, but the calls to _PyPegen_byte_offset_to_character_offset
that are made a bit further below that eventually call PyUnicode_*
APIs as well. After additionally caching those, I get:
cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 16:37:58) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.08269691467285156
0.08281183242797852
500 3 ':tombol_a_(golongan_darah):' 1 0 0.004801034927368164
1000 3 ':flagge_anguilla:' 1 0 0.003465890884399414
1500 3 ':флаг_Армения:' 1 0 0.004238128662109375
2000 3 ':grupo_sanguíneo_b:' 1 0 0.0038290023803710938
2500 3 ':bayrak_belarus:' 1 0 0.005483865737915039
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 0 0.005335807800292969
3500 3 ':Bulgaria:' 1 0 0.0034456253051757812
4000 3 ':bandiera_camerun:' 1 0 0.0051038265228271484
4500 2 2 1 0 0.003175020217895508
5000 3 ':bendera_pulau_natal:' 1 0 0.0032072067260742188
0.1256239414215088
5500 3 'es' 1 0 0.003175973892211914
[...]
351500 3 'es' 1 0 0.0032269954681396484
352000 3 'ko' 1 0 0.003362894058227539
352500 3 'pt' 1 0 0.003036022186279297
Time taken: 2.6852049827575684
I'll push a PR shortly.
@lysnikolaou any news on this? We have other people reporting a problem like this in 3.12.
Thanks for the reminder @nedbat! Turns out that this was even trickier than I though and the fix I showed in https://github.com/python/cpython/issues/119118#issuecomment-2118355032 was buggy. My patch is almost there, but I didn't get the time to fix some final details. I'll have a PR up by early next week.
Tell me if you want me to help or pick it up if you done have time @lysnikolaou
Bug report
Bug description:
There seems to be a significant performance regression in
tokenize.generate_tokens()
between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.To reproduce, rename the file largedict.py.txt to
largedict.py
in the same directory as the script below, then run the script. That file comes from https://github.com/nedbat/coveragepy/issues/1785.For Python 3.12, this results in:
For Python 3.11, this results in:
That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.
I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs