paulbrodersen / entropy_estimators

Estimators for the entropy and other information theoretic quantities of continuous distributions
GNU General Public License v3.0
132 stars 26 forks source link

Parallelization in query* calls and update to KDTree #8

Closed daesungc closed 3 years ago

daesungc commented 3 years ago

Hello, thanks for creating this package.

There have been updates in SciPy that allows for parallel processing in the various tree query calls used in this package that I have found to be quite beneficial. Additionally, SciPy seems to prefer KDTree over cKDTree going forward.

I have changed to KDTree accordingly and added workers parameter (Default to 1 as in SciPy) that can be used to speed up entropy estimations. Tests still pass although there aren't much improvements in speed there. However, using get_h on my sample data the difference is significant. Done with 18 workers on an i9-10900K.

Test Single Parallel (18 workers)
test_get_h 1.44 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.58 ms ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
test_get_h_1d 706 µs ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 932 µs ± 3.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
test_get_mi 358 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 366 ms ± 23.7 ms per loopget_h(): 1.44 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
test_get_pmi 811 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 810 ms ± 72.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My sample data: float64 np.array with size (90000, 31) 1min 55s ± 4.6 s per loop (mean ± std. dev. of 7 runs, 1 loop each) 17.9 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Hope this can be useful, and any feedback is welcome. Thanks!

paulbrodersen commented 3 years ago

Hey, what a great PR! Thank you very much. If you have any other suggestions, do let me know!