Closed mahieyin-rahmun closed 2 years ago
hey @mahieyin-rahmun - really cool :) The only issue here I have is the fact that there are (again) two associations methods (something I removed from previous versions). There's a lot of mutual code between these two, especially all the preprocessing at the beginning.
What I suggest is having the multiple-core part taken out and placed in a private method, and have it triggered using a flag (for example, if the required number of cores is more than 1).
Also, one thing I'm missing from your (really cool) time analysis is how long it takes for the current version to compute? (meaning, what's benchmark_function(associations, dataframe)
)?
@shakedzy, thank you very much for your feedback.
I initially wanted to have changes in the associations()
function, but it has some nested inner functions which multiprocessing
module cannot handle (since inner functions cannot be pickled). If I have to do as you suggested, I would also have to refactor the inner functions to be class-level or utility functions. Is that okay with you?
I agree about the time analysis part, I thought that the performance of the multiprocessing version restricted to a single core would be the performance of the current version but that may not be the case. I will update the PR.
@mahieyin-rahmun you don't have to force everything into one method - you can have a separate private method for the multiprocessing, and call it from within associations
. My goal here is to reduce duplicated code as much as possible, but still keep everything simple and maintainable
Hi, @shakedzy, I have updated the PR with the changes requested. Kindly let me know if additional changes are required.
Here's the benchmark for the current code:
I have also updated the benchmark code and chart in the original comment above.
This is superb work, @mahieyin-rahmun ! The only issue now is that I just merged your other PR that enforces using Black. This branch needs to merge master
as there are conflicts
Merged 🎉 great job @mahieyin-rahmun !
@mahieyin-rahmun: Thank you! I have an Core i9-10900k with 10 cores / 20 threads and a Core i9-12900k with 16 Cores (8 performance, 8 efficiency) and 24 threads. If you are still interested in benchmarks I could run them on my machines or is that not necessary anymore?
@mahieyin-rahmun @shakedzy I copied your benchmark code into benchmark.py and tried to run it but failed, am I doing something wrong?
$ pip install git+https://github.com/shakedzy/dython.git@master
[...]
$ python benchmark.py
Run: 0
Traceback (most recent call last):
File "/home/konrad/tmp/test/benchmark.py", line 91, in <module>
run_benchmark()
File "/home/konrad/tmp/test/benchmark.py", line 54, in run_benchmark
times_vanialla = benchmark_function(
File "/home/konrad/tmp/test/benchmark.py", line 28, in benchmark_function
results = func(dataframe, compute_only=True, **kwargs)
TypeError: associations() got an unexpected keyword argument 'use_multiprocessing'
Hi @KonradHoeffner, I believe the keyword arguments were shortened a bit before the PR was merged. You can see from the source code that the keyword argument is multiprocessing
(drop the use_
). I will update the benchmark code as well.
@mahieyin-rahmun thanks, that works! Results from my Intel Core i9-10900k with 10 cores and 20 threads:
It seems as if hyperthreading doesn't help in this case. 48 GB DDR4 RAM at 3000Mhz, Arch Linux default kernel version 5.19.7, Python 3.10.7.
Yes, the current attempt at parallelizing does not perform well with hyperthreading, so, only physical cores are being utilized. Your 16 and 20 core results are actually still utilizing the 10 physical cores.
I believe the next steps could be to parallelize/vectorize the helper functions. I think someone is/was already working on it.
Results below from an Intel Core i9-12900k with 8 Performance and 8 Efficiency Cores, 32 GB DDR5 RAM. Interestingly, the time goes up again when using more cores than the number of performance cores.
Another run:
I do not have much knowledge about performance cores and efficiency cores of the Intel SKUs, but it seems like efficiency cores are somehow ending up bottlenecking the process. Perhaps the performance overhead of starting a separate process on an efficiency core (which probably has lower base clock speeds compared to performance cores?) and then collecting the results from that process bears a large enough overhead to slow down computation.
This is a really interesting discussion, @mahieyin-rahmun and @KonradHoeffner . I think to perhaps add a "find the optimal num of cores" to the library, which basically runs this test. What do you think?
@shakedzy some thoughts:
Ultimately, limiting the thread count to the number of physical performance cores seems like a good approximation to me, or if there there is no easy way to get the number of non-efficiency cores in case of a hybrid architecture a good heuristic may be to just cap it at 8 or 10. On the other hand, if not clearly documented this may create problems in the future when we all have 100 core machines and wonder why it doesn't scale to the number of cores with gigantic input sizes. But I am neither a developer nor a heavy user of this library so I don't feel qualified to make such a decision.
@KonradHoeffner I see your points. Perhaps it will be best to simply add this test as an external script in this repo. We can refer to it in the documentation.
Why
This PR attempts to address #117 by adding a separate function called
associations_parallel()
which utilizes themultiprocessing
module to parallelize computations.How
If the dataset has more nominal (categorical) columns than numerical columns, the
associations()
function slows down considerably since it is performing all the calculations in a nested for loops. This is also happening on a single CPU core. We can exploit the fact that the column index pairs used by the nested for loops can be precomputed (essentially cartesian product of the indices of the columns of the dataset) and then the task of further computation can be delegated to multiple processes.Benchmarking
I used the following script to run benchmarks. My system information can be found at the bottom of the post.
System Information
CPU
Memory
OS
Python
Anaconda
Caveats
m
cores doesn't necessarily cut down the computation time by a factor ofm
, but some performance gains are achieved.What issue does this fix
Closes #117