Open TheTesla opened 3 weeks ago
cc @DrTodd13
I tried it on AMD Epyc now:
$ python3 pardictimpl_slow.py
/home/ubuntu/py-par-dict/pardictimpl_slow.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
tmp += par_dict_getitem(pdict, i)
- threads: 1 - time: 11.008 cputime: 11.01
- threads: 1 - time: 2.369 cputime: 2.37
- threads: 2 - time: 6.454 cputime: 12.91
- threads: 3 - time: 6.427 cputime: 19.28
- threads: 4 - time: 6.578 cputime: 26.31
- threads: 5 - time: 4.277 cputime: 21.38
- threads: 6 - time: 3.504 cputime: 21.02
- threads: 7 - time: 2.902 cputime: 20.32
- threads: 8 - time: 2.401 cputime: 19.21
- threads: 9 - time: 2.504 cputime: 22.53
- threads: 10 - time: 2.159 cputime: 21.59
- threads: 11 - time: 2.055 cputime: 22.61
- threads: 12 - time: 1.898 cputime: 22.77
- threads: 13 - time: 1.716 cputime: 22.31
- threads: 14 - time: 1.697 cputime: 23.75
- threads: 15 - time: 1.609 cputime: 24.14
- threads: 16 - time: 1.603 cputime: 25.64
[4.747s](.venv) ubuntu@numbatester:~/py-par-dict$ python3 pardictimpl_fast.py
/home/ubuntu/py-par-dict/pardictimpl_fast.py:137: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
tmp += par_dict_getitem(pdict, i)
- threads: 1 - time: 9.165 cputime: 9.17
- threads: 1 - time: 1.966 cputime: 1.97
- threads: 2 - time: 3.651 cputime: 7.30
- threads: 3 - time: 2.768 cputime: 8.30
- threads: 4 - time: 2.641 cputime: 10.57
- threads: 5 - time: 1.896 cputime: 9.48
- threads: 6 - time: 1.455 cputime: 8.73
- threads: 7 - time: 1.275 cputime: 8.92
- threads: 8 - time: 1.203 cputime: 9.63
- threads: 9 - time: 1.161 cputime: 10.45
- threads: 10 - time: 1.126 cputime: 11.26
- threads: 11 - time: 0.943 cputime: 10.37
- threads: 12 - time: 0.923 cputime: 11.08
- threads: 13 - time: 0.799 cputime: 10.39
- threads: 14 - time: 0.833 cputime: 11.66
- threads: 15 - time: 0.745 cputime: 11.18
- threads: 16 - time: 0.880 cputime: 14.08
__Time Stamp__
Report started (local time) : 2024-06-08 10:04:18.406650
UTC start time : 2024-06-08 10:04:18.406662
Running time (s) : 0.395069
__Hardware Information__
Machine : x86_64
CPU Name : znver2
CPU Count : 16
Number of accessible CPUs : 16
List of accessible CPUs cores : 0-15
CFS Restrictions (CPUs worth of runtime) : None
CPU Features : 64bit adx aes avx avx2 bmi bmi2
clflushopt clwb clzero cmov crc32
cx16 cx8 f16c fma fsgsbase fxsr
lzcnt mmx movbe mwaitx pclmul
popcnt prfchw rdpid rdrnd rdseed
sahf sha sse sse2 sse3 sse4.1
sse4.2 sse4a ssse3 wbnoinvd xsave
xsavec xsaveopt xsaves
Memory Total (MB) : 64167
Memory Available (MB) : 62516
__OS Information__
Platform Name : Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Platform Release : 6.8.0-31-generic
OS Name : Linux
OS Version : #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024
OS Specific Version : ?
Libc Version : glibc 2.39
__Python Information__
Python Compiler : GCC 13.2.0
Python Implementation : CPython
Python Version : 3.12.3
Python Locale : C.UTF-8
__Numba Toolchain Versions__
Numba Version : 0.60.0rc1
llvmlite Version : 0.43.0rc1
__LLVM Information__
LLVM Version : 14.0.6
__CUDA Information__
CUDA Device Initialized : False
CUDA Driver Version : ?
CUDA Runtime Version : ?
CUDA NVIDIA Bindings Available : ?
CUDA NVIDIA Bindings In Use : ?
CUDA Minor Version Compatibility Available : ?
CUDA Minor Version Compatibility Needed : ?
CUDA Minor Version Compatibility In Use : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None
__NumPy Information__
NumPy Version : 1.26.4
NumPy Supported SIMD features : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_KNL', 'AVX512_KNM', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected : False
__SVML Information__
SVML State, config.USING_SVML : False
SVML Library Loaded : False
llvmlite Using SVML Patched LLVM : True
SVML Operational : False
__Threading Layer Information__
TBB Threading Layer Available : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available : True
+-->Vendor: GNU
Workqueue Threading Layer Available : True
+-->Workqueue imported successfully.
__Numba Environment Variable Information__
None found.
__Conda Information__
Conda not available.
__Installed Packages__
Package Version
-------- ---------
llvmlite 0.43.0rc1
numba 0.60.0rc1
numpy 1.26.4
pip 24.0
No errors reported.
__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning: Conda not available.
Error was [Errno 2] No such file or directory: 'conda'
Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us
I did additional debugging. I found a workaround:
https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow_fix.py
$ python3 pardictimpl_slow_fix.py
/home/stefan/testing/py-par-dict/pardictimpl_slow_fix.py:142: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
tmp += par_dict_getitem(pdict, i)
- threads: 1 - time: 8.455 cputime: 8.46
- threads: 1 - time: 2.395 cputime: 2.40
- threads: 2 - time: 2.844 cputime: 5.69
- threads: 3 - time: 2.213 cputime: 6.64
- threads: 4 - time: 2.181 cputime: 8.72
- threads: 5 - time: 1.622 cputime: 8.11
- threads: 6 - time: 1.524 cputime: 9.15
- threads: 7 - time: 1.291 cputime: 9.04
- threads: 8 - time: 1.175 cputime: 9.40
- threads: 9 - time: 1.014 cputime: 9.13
- threads: 10 - time: 0.973 cputime: 9.73
- threads: 11 - time: 0.868 cputime: 9.55
- threads: 12 - time: 0.901 cputime: 10.81
- threads: 13 - time: 0.761 cputime: 9.89
- threads: 14 - time: 0.724 cputime: 10.14
- threads: 15 - time: 0.714 cputime: 10.72
- threads: 16 - time: 0.741 cputime: 11.86
If the never executed code is commented out, the variable dict
is optimized away, so it is not unpacked from the state
tuple. But, if the code is in, it is pulled in and somehow unpacked, even it is not accessed. Normally the optimizer should move the "unpack" step into the if
condition like in my new implementation.
What is also a bit weird: Just pulling in the list
of dictionaries dicts
needs so many extra resources. Isn't it just a pointer to the list
?
I did some profiling and it seems that it does 308,789 extra allocations which is the equivalent number of allocations as running the code (so you basically run the function almost twice), and kind of aligns with the timings of the benchmarking you're doing. I also recommend you to not use time use something else like time.time()
it's better to use time.perf_counter()
or timeit dedicated file.
Command line: /home/user/.local/lib/python3.10/site-packages/memray/__main__.py run fast.py Start time: 2024-06-11 20:55:50.468000 End time: 2024-06-11 20:56:03.760000 Total number of allocations: 5758033 Total number of frames seen: 7712 Peak memory usage: 379.6 MiB Python allocator: pymalloc
Command line: /home/user/.local/lib/python3.10/site-packages/memray/__main__.py run slow.py Start time: 2024-06-11 20:58:21.480000 End time: 2024-06-11 20:58:39.027000 Total number of allocations: 6138822 Total number of frames seen: 7710 Peak memory usage: 381.0 MiB Python allocator: pymalloc
I did the profiling again, now with may "fix"/workaround. The Total Memory
is nearly doubled. I don't know why.
I found out, inline = 'always'
doesn't change much:
https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow_inline.py
$ python3 pardictimpl_slow_inline.py
/home/stefan/testing/py-par-dict/pardictimpl_slow_inline.py:117: NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
return dicts[hash(key)%nothrds][key]
- threads: 1 - time: 12.926 cputime: 12.93
- threads: 1 - time: 2.512 cputime: 2.51
- threads: 2 - time: 4.094 cputime: 8.19
- threads: 3 - time: 3.623 cputime: 10.87
- threads: 4 - time: 3.549 cputime: 14.20
- threads: 5 - time: 3.185 cputime: 15.93
- threads: 6 - time: 3.039 cputime: 18.24
- threads: 7 - time: 2.629 cputime: 18.40
- threads: 8 - time: 2.830 cputime: 22.64
- threads: 9 - time: 2.664 cputime: 23.97
- threads: 10 - time: 2.725 cputime: 27.25
- threads: 11 - time: 2.397 cputime: 26.37
- threads: 12 - time: 2.635 cputime: 31.62
- threads: 13 - time: 2.716 cputime: 35.31
- threads: 14 - time: 2.428 cputime: 33.99
- threads: 15 - time: 2.557 cputime: 38.36
- threads: 16 - time: 2.522 cputime: 40.35
I thought, it would optimize the state tuple away.
I implemented an approach of lock-free parallel dictionary write. But I found weird performance issues. The total time needed to complete increased with the number of threads. I found out, this issue can be solved by removing code, which is never executed in some applications.
This is with the code, which is never used, commented in:
https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_slow.py
This is with the code, which is never used, commented out:
https://github.com/TheTesla/py-par-dict/blob/master/pardictimpl_fast.py
Normally, there shouldn't be any difference. Both should behave like the fast one.
numba -s