diegodoimo commented 2 years ago

I added a ''faster'' version of pak, vectorizing all the for loops and adding a cython version to optimize the likelihood of the full dataset and not just that of a single point. Unfortunately, the performance improvement can be appreciated only for relatively large data sizes:

15% faster for 90k points 50% faster for 250k points

codecov-commenter commented 2 years ago

Codecov Report

Merging #72 (17c7f24) into main (f290feb) will decrease coverage by 1.81%. The diff coverage is 76.19%.

@@            Coverage Diff             @@
##             main      #72      +/-   ##
==========================================
- Coverage   80.61%   78.79%   -1.82%     
==========================================
  Files          10       10              
  Lines        1145     1160      +15     
==========================================
- Hits          923      914       -9     
- Misses        222      246      +24

Impacted Files	Coverage Δ
dadapy/density_estimation.py	`76.29% <75.00%> (-0.40%)`	:arrow_down:
dadapy/_utils/density_estimation.py	`54.83% <76.47%> (-34.96%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f290feb...17c7f24. Read the comment docs.

AldoGl commented 2 years ago

Thanks @diegodoimo this is a change that me and @imacocco had in mind for a long time. Apart from the small code improvements suggested I have the following curiosity: for small datasets do we see a near zero improvement or do we actually see worse results using the full Cython implementation?

diegodoimo commented 2 years ago

Better in any case. It can be easily tested as I left the flag 'optimized' to select the original implementation, when set to false. (I left it as I didn't want to remove a big piece of the code until we are sure the 'optimized' one can be fine)

sissa-data-science / DADApy

added vectorized pak version #72

Codecov Report