Code, which is never executed, slows down everything by factor 3

TheTesla commented 3 weeks ago

I implemented an approach of lock-free parallel dictionary write. But I found weird performance issues. The total time needed to complete increased with the number of threads. I found out, this issue can be solved by removing code, which is never executed in some applications.

This is with the code, which is never used, commented in:

https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow.py

$ python3 pardictimpl_slow.py 
/home/stefan/testing/py-par-dict/pardictimpl_slow.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  tmp += par_dict_getitem(pdict, i)
- threads:  1 - time:  13.031 cputime:  13.03
- threads:  1 - time:  2.621 cputime:  2.62
- threads:  2 - time:  4.470 cputime:  8.94
- threads:  3 - time:  4.323 cputime:  12.97
- threads:  4 - time:  3.970 cputime:  15.88
- threads:  5 - time:  3.469 cputime:  17.35
- threads:  6 - time:  3.468 cputime:  20.81
- threads:  7 - time:  3.596 cputime:  25.17
- threads:  8 - time:  3.437 cputime:  27.50
- threads:  9 - time:  3.028 cputime:  27.25
- threads:  10 - time:  3.151 cputime:  31.51
- threads:  11 - time:  2.797 cputime:  30.76
- threads:  12 - time:  3.058 cputime:  36.70
- threads:  13 - time:  2.783 cputime:  36.18
- threads:  14 - time:  2.560 cputime:  35.84
- threads:  15 - time:  2.701 cputime:  40.52
- threads:  16 - time:  2.689 cputime:  43.03

This is with the code, which is never used, commented out:

https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_fast.py

$ python3 pardictimpl_fast.py 
/home/stefan/testing/py-par-dict/pardictimpl_fast.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  tmp += par_dict_getitem(pdict, i)
- threads:  1 - time:  11.240 cputime:  11.24
- threads:  1 - time:  2.056 cputime:  2.06
- threads:  2 - time:  2.791 cputime:  5.58
- threads:  3 - time:  2.524 cputime:  7.57
- threads:  4 - time:  2.013 cputime:  8.05
- threads:  5 - time:  1.645 cputime:  8.22
- threads:  6 - time:  1.464 cputime:  8.78
- threads:  7 - time:  1.305 cputime:  9.13
- threads:  8 - time:  1.291 cputime:  10.33
- threads:  9 - time:  0.985 cputime:  8.86
- threads:  10 - time:  0.955 cputime:  9.55
- threads:  11 - time:  0.846 cputime:  9.30
- threads:  12 - time:  0.881 cputime:  10.57
- threads:  13 - time:  0.737 cputime:  9.58
- threads:  14 - time:  0.676 cputime:  9.46
- threads:  15 - time:  0.657 cputime:  9.85
- threads:  16 - time:  0.708 cputime:  11.33

Normally, there shouldn't be any difference. Both should behave like the fast one.

numba -s

__Time Stamp__
Report started (local time)                   : 2024-06-07 13:59:39.707376
UTC start time                                : 2024-06-07 11:59:39.707381
Running time (s)                              : 2.028285

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : alderlake
CPU Count                                     : 16
Number of accessible CPUs                     : 16
List of accessible CPUs cores                 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 avxvnni bmi
                                                bmi2 clflushopt clwb cmov crc32
                                                cx16 cx8 f16c fma fsgsbase fxsr
                                                gfni hreset invpcid lzcnt mmx
                                                movbe movdir64b movdiri pclmul pku
                                                popcnt prfchw ptwrite rdpid rdrnd
                                                rdseed sahf serialize sha shstk
                                                sse sse2 sse3 sse4.1 sse4.2 ssse3
                                                vaes vpclmulqdq waitpkg xsave
                                                xsavec xsaveopt xsaves

Memory Total (MB)                             : 39801
Memory Available (MB)                         : 19299

__OS Information__
Platform Name                                 : Linux-6.5.0-35-generic-x86_64-with-glibc2.38
Platform Release                              : 6.5.0-35-generic
OS Name                                       : Linux
OS Version                                    : #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024
OS Specific Version                           : ?
Libc Version                                  : glibc 2.38

__Python Information__
Python Compiler                               : GCC 13.2.0
Python Implementation                         : CPython
Python Version                                : 3.11.6
Python Locale                                 : de_DE.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.60.0rc1
llvmlite Version                              : 0.43.0rc1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
Found 1 CUDA devices
id 0    b'NVIDIA T550 Laptop GPU'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 3
                                    UUID: GPU-e7fd2a1d-bbf7-b0fb-0602-67edd9abdca6
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
    1/1 devices are supported

CUDA Libraries Test Output:
None

__NumPy Information__
NumPy Version                                 : 1.26.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_KNL', 'AVX512_KNM', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda not available.

__Installed Packages__
Package                   Version
------------------------- ---------------
aiofiles                  23.2.1
aiohttp                   3.9.1
aiosignal                 1.3.1
altair                    5.2.0
annotated-types           0.6.0
anyio                     3.7.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
asyncer                   0.0.2
attrs                     23.2.0
Babel                     2.14.0
beautifulsoup4            4.12.3
bleach                    6.1.0
boto                      2.49.0
build                     1.0.3
CacheControl              0.13.1
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        3.3.2
cleo                      2.1.0
click                     8.1.7
cmake                     3.28.3
colorama                  0.4.6
coloredlogs               15.0.1
comm                      0.2.1
contourpy                 1.2.0
crashtest                 0.4.1
cryptography              42.0.1
cycler                    0.12.1
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
distlib                   0.3.8
distro                    1.9.0
dulwich                   0.21.7
executing                 2.0.1
fastapi                   0.108.0
fastjsonschema            2.19.1
ffmpy                     0.3.1
filelock                  3.13.1
filetype                  1.2.0
flatbuffers               23.5.26
fonttools                 4.47.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.12.2
gitdb                     4.0.11
GitPython                 3.1.43
gnupg                     2.3.1
gradio                    4.12.0
gradio_client             0.8.0
h11                       0.14.0
httpcore                  1.0.2
httpx                     0.26.0
huggingface-hub           0.20.1
humanfriendly             10.0
idna                      3.6
imageio                   2.33.1
importlib-metadata        7.0.1
importlib-resources       6.1.1
installer                 0.7.0
ipykernel                 6.29.2
ipython                   8.21.0
ipywidgets                8.1.2
isoduration               20.11.0
jaraco.classes            3.3.0
jedi                      0.19.1
jeepney                   0.8.0
Jinja2                    3.1.2
json5                     0.9.14
jsonpointer               2.4
jsonschema                4.20.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.0
jupyter-console           6.6.3
jupyter_core              5.7.1
jupyter-events            0.9.0
jupyter-lsp               2.2.2
jupyter_server            2.12.5
jupyter_server_terminals  0.5.2
jupyterlab                4.1.0
jupyterlab_pygments       0.3.0
jupyterlab_server         2.25.2
jupyterlab_widgets        3.0.10
k3d                       2.16.1
keyring                   24.3.0
kiwisolver                1.4.5
lazy_loader               0.3
llvmlite                  0.43.0rc1
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.2
matplotlib-inline         0.1.6
mdurl                     0.1.2
mistune                   3.0.2
more-itertools            10.2.0
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
nbclient                  0.9.0
nbconvert                 7.16.0
nbformat                  5.9.2
nest-asyncio              1.6.0
networkx                  3.2.1
notebook                  7.0.7
notebook_shim             0.2.3
numba                     0.60.0rc1
numpy                     1.26.4
numpy-stl                 3.1.1
onnxruntime               1.16.3
openai                    1.30.1
opencv-python-headless    4.9.0.80
orjson                    3.9.10
overrides                 7.7.0
packaging                 23.2
pandas                    2.1.4
pandocfilters             1.5.1
parso                     0.8.3
pexpect                   4.9.0
pillow                    10.2.0
pip                       23.2
pkginfo                   1.9.6
platformdirs              3.11.0
poetry                    1.7.1
poetry-core               1.8.1
poetry-plugin-export      1.6.0
pooch                     1.8.0
profila                   0.2.1
prometheus-client         0.19.0
prompt-toolkit            3.0.43
protobuf                  4.25.1
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pycparser                 2.21
pydantic                  2.5.3
pydantic_core             2.14.6
pydub                     0.25.1
pygdbmi                   0.11.0.0
Pygments                  2.17.2
PyMatting                 1.1.12
pyparsing                 3.1.1
pyproject_hooks           1.0.0
python-dateutil           2.8.2
python-json-logger        2.0.7
python-multipart          0.0.6
python-utils              3.8.2
pytz                      2023.3.post1
PyYAML                    6.0.1
pyzmq                     25.1.2
qtconsole                 5.5.1
QtPy                      2.4.1
quantities                0.15.0
rapidfuzz                 3.6.1
referencing               0.32.0
rembg                     2.0.53
requests                  2.31.0
requests-toolbelt         1.0.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.0
rpds-py                   0.16.2
scikit-image              0.22.0
scipy                     1.11.4
SecretStorage             3.3.3
semantic-version          2.10.0
Send2Trash                1.8.2
setuptools                68.1.2
shellingham               1.5.4
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
soupsieve                 2.5
stack-data                0.6.3
starlette                 0.32.0.post1
sympy                     1.12
terminado                 0.18.0
tifffile                  2023.12.9
tinycss2                  1.2.1
tomlkit                   0.12.0
toolz                     0.12.0
tornado                   6.4
tqdm                      4.66.1
traitlets                 5.14.1
traittypes                0.2.1
trove-classifiers         2024.1.8
typer                     0.9.0
types-python-dateutil     2.8.19.20240106
typing_extensions         4.9.0
tzdata                    2023.4
uri-template              1.3.0
urllib3                   2.1.0
uvicorn                   0.25.0
virtualenv                20.25.0
watchdog                  3.0.0
wcwidth                   0.2.13
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
websockets                11.0.3
widgetsnbextension        4.0.10
xyzcad                    0.3.0.post2
yarl                      1.9.4
z3-solver                 4.13.0.0
zipp                      3.17.0

No errors reported.

__Warning log__
Warning (cuda): Probing CUDA failed (device and driver present, runtime problem?)
(cuda) <class 'OSError'>: libcudart.so: cannot open shared object file: No such file or directory
Warning: Conda not available.
 Error was [Errno 2] No such file or directory: 'conda'

Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us

esc commented 3 weeks ago

cc @DrTodd13

TheTesla commented 3 weeks ago

I tried it on AMD Epyc now:

pardictimpl_slow.py

$ python3 pardictimpl_slow.py
/home/ubuntu/py-par-dict/pardictimpl_slow.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  tmp += par_dict_getitem(pdict, i)
- threads:  1 - time:  11.008 cputime:  11.01
- threads:  1 - time:  2.369 cputime:  2.37
- threads:  2 - time:  6.454 cputime:  12.91
- threads:  3 - time:  6.427 cputime:  19.28
- threads:  4 - time:  6.578 cputime:  26.31
- threads:  5 - time:  4.277 cputime:  21.38
- threads:  6 - time:  3.504 cputime:  21.02
- threads:  7 - time:  2.902 cputime:  20.32
- threads:  8 - time:  2.401 cputime:  19.21
- threads:  9 - time:  2.504 cputime:  22.53
- threads:  10 - time:  2.159 cputime:  21.59
- threads:  11 - time:  2.055 cputime:  22.61
- threads:  12 - time:  1.898 cputime:  22.77
- threads:  13 - time:  1.716 cputime:  22.31
- threads:  14 - time:  1.697 cputime:  23.75
- threads:  15 - time:  1.609 cputime:  24.14
- threads:  16 - time:  1.603 cputime:  25.64

pardictimpl_fast.py

[4.747s](.venv)  ubuntu@numbatester:~/py-par-dict$ python3 pardictimpl_fast.py
/home/ubuntu/py-par-dict/pardictimpl_fast.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  tmp += par_dict_getitem(pdict, i)
- threads:  1 - time:  9.165 cputime:  9.17
- threads:  1 - time:  1.966 cputime:  1.97
- threads:  2 - time:  3.651 cputime:  7.30
- threads:  3 - time:  2.768 cputime:  8.30
- threads:  4 - time:  2.641 cputime:  10.57
- threads:  5 - time:  1.896 cputime:  9.48
- threads:  6 - time:  1.455 cputime:  8.73
- threads:  7 - time:  1.275 cputime:  8.92
- threads:  8 - time:  1.203 cputime:  9.63
- threads:  9 - time:  1.161 cputime:  10.45
- threads:  10 - time:  1.126 cputime:  11.26
- threads:  11 - time:  0.943 cputime:  10.37
- threads:  12 - time:  0.923 cputime:  11.08
- threads:  13 - time:  0.799 cputime:  10.39
- threads:  14 - time:  0.833 cputime:  11.66
- threads:  15 - time:  0.745 cputime:  11.18
- threads:  16 - time:  0.880 cputime:  14.08

numba -s

__Time Stamp__
Report started (local time)                   : 2024-06-08 10:04:18.406650
UTC start time                                : 2024-06-08 10:04:18.406662
Running time (s)                              : 0.395069

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : znver2
CPU Count                                     : 16
Number of accessible CPUs                     : 16
List of accessible CPUs cores                 : 0-15
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
                                                clflushopt clwb clzero cmov crc32
                                                cx16 cx8 f16c fma fsgsbase fxsr
                                                lzcnt mmx movbe mwaitx pclmul
                                                popcnt prfchw rdpid rdrnd rdseed
                                                sahf sha sse sse2 sse3 sse4.1
                                                sse4.2 sse4a ssse3 wbnoinvd xsave
                                                xsavec xsaveopt xsaves

Memory Total (MB)                             : 64167
Memory Available (MB)                         : 62516

__OS Information__
Platform Name                                 : Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Platform Release                              : 6.8.0-31-generic
OS Name                                       : Linux
OS Version                                    : #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024
OS Specific Version                           : ?
Libc Version                                  : glibc 2.39

__Python Information__
Python Compiler                               : GCC 13.2.0
Python Implementation                         : CPython
Python Version                                : 3.12.3
Python Locale                                 : C.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.60.0rc1
llvmlite Version                              : 0.43.0rc1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__NumPy Information__
NumPy Version                                 : 1.26.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_KNL', 'AVX512_KNM', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda not available.

__Installed Packages__
Package  Version
-------- ---------
llvmlite 0.43.0rc1
numba    0.60.0rc1
numpy    1.26.4
pip      24.0

No errors reported.

__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning: Conda not available.
 Error was [Errno 2] No such file or directory: 'conda'

Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us

TheTesla commented 2 weeks ago

I did additional debugging. I found a workaround:

https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow_fix.py

$ python3 pardictimpl_slow_fix.py 
/home/stefan/testing/py-par-dict/pardictimpl_slow_fix.py:142: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  tmp += par_dict_getitem(pdict, i)
- threads:  1 - time:  8.455 cputime:  8.46
- threads:  1 - time:  2.395 cputime:  2.40
- threads:  2 - time:  2.844 cputime:  5.69
- threads:  3 - time:  2.213 cputime:  6.64
- threads:  4 - time:  2.181 cputime:  8.72
- threads:  5 - time:  1.622 cputime:  8.11
- threads:  6 - time:  1.524 cputime:  9.15
- threads:  7 - time:  1.291 cputime:  9.04
- threads:  8 - time:  1.175 cputime:  9.40
- threads:  9 - time:  1.014 cputime:  9.13
- threads:  10 - time:  0.973 cputime:  9.73
- threads:  11 - time:  0.868 cputime:  9.55
- threads:  12 - time:  0.901 cputime:  10.81
- threads:  13 - time:  0.761 cputime:  9.89
- threads:  14 - time:  0.724 cputime:  10.14
- threads:  15 - time:  0.714 cputime:  10.72
- threads:  16 - time:  0.741 cputime:  11.86

What actually happens

If the never executed code is commented out, the variable dict is optimized away, so it is not unpacked from the state tuple. But, if the code is in, it is pulled in and somehow unpacked, even it is not accessed. Normally the optimizer should move the "unpack" step into the if condition like in my new implementation.

What is also a bit weird: Just pulling in the list of dictionaries dicts needs so many extra resources. Isn't it just a pointer to the list?

koyomitan3 commented 2 weeks ago

I did some profiling and it seems that it does 308,789 extra allocations which is the equivalent number of allocations as running the code (so you basically run the function almost twice), and kind of aligns with the timings of the benchmarking you're doing. I also recommend you to not use time use something else like time.time() it's better to use time.perf_counter() or timeit dedicated file.

Command line: /home/user/.local/lib/python3.10/site-packages/memray/__main__.py run fast.py Start time: 2024-06-11 20:55:50.468000 End time: 2024-06-11 20:56:03.760000 Total number of allocations: 5758033 Total number of frames seen: 7712 Peak memory usage: 379.6 MiB Python allocator: pymalloc

Command line: /home/user/.local/lib/python3.10/site-packages/memray/__main__.py run slow.py Start time: 2024-06-11 20:58:21.480000 End time: 2024-06-11 20:58:39.027000 Total number of allocations: 6138822 Total number of frames seen: 7710 Peak memory usage: 381.0 MiB Python allocator: pymalloc

0xj2aDtnJT9770f3941ef36d0677-1337-776 0xHEaRqUMc8758f1adbdc903ce00-1331-801

TheTesla commented 2 weeks ago

I did the profiling again, now with may "fix"/workaround. The Total Memory is nearly doubled. I don't know why.

Bildschirmfoto vom 2024-06-14 11-22-43

Bildschirmfoto vom 2024-06-14 11-23-07

Bildschirmfoto vom 2024-06-14 11-23-21

TheTesla commented 17 hours ago

I found out, inline = 'always' doesn't change much:

https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow_inline.py

$ python3 pardictimpl_slow_inline.py
/home/stefan/testing/py-par-dict/pardictimpl_slow_inline.py:117: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  return dicts[hash(key)%nothrds][key]
- threads:  1 - time:  12.926 cputime:  12.93
- threads:  1 - time:  2.512 cputime:  2.51
- threads:  2 - time:  4.094 cputime:  8.19
- threads:  3 - time:  3.623 cputime:  10.87
- threads:  4 - time:  3.549 cputime:  14.20
- threads:  5 - time:  3.185 cputime:  15.93
- threads:  6 - time:  3.039 cputime:  18.24
- threads:  7 - time:  2.629 cputime:  18.40
- threads:  8 - time:  2.830 cputime:  22.64
- threads:  9 - time:  2.664 cputime:  23.97
- threads:  10 - time:  2.725 cputime:  27.25
- threads:  11 - time:  2.397 cputime:  26.37
- threads:  12 - time:  2.635 cputime:  31.62
- threads:  13 - time:  2.716 cputime:  35.31
- threads:  14 - time:  2.428 cputime:  33.99
- threads:  15 - time:  2.557 cputime:  38.36
- threads:  16 - time:  2.522 cputime:  40.35

I thought, it would optimize the state tuple away.

numba / numba