seung-lab / euclidean-distance-transform-3d

Euclidean distance & signed distance transform for multi-label 3D anisotropic images using marching parabolas.
GNU General Public License v3.0
239 stars 37 forks source link

Moving iteration over 2nd axis under function submitted to a thread pool for 3d EDT speeds up execution time to up to 3x #50

Closed Pavlik1400 closed 8 months ago

Pavlik1400 commented 8 months ago

Hello, I just noticed that in 3d EDT, code iterates over two axes and submits iteration over the 3rd axis to the thread pool. It seems like squared_edt_1d_multi_seg and squared_edt_1d_parabolic_multi_seg are quick, so thread pool overhead is very noticeable. I have quickly moved iteration over the second axis in the function submitted to the thread pool and updated cpp/test.cpp to iterative over a number of workers, and got the following results:

Before: Took 4.584 sec. with nw=1 Took 3.602 sec. with nw=2 Took 3.066 sec. with nw=3 Took 2.972 sec. with nw=4 Took 2.946 sec. with nw=5 Took 2.739 sec. with nw=6 Took 2.499 sec. with nw=7 Took 2.351 sec. with nw=8 Took 2.264 sec. with nw=9 Took 2.133 sec. with nw=10 Took 2.043 sec. with nw=11 Took 2.161 sec. with nw=12 Took 2.169 sec. with nw=13 Took 2.040 sec. with nw=14 Took 1.968 sec. with nw=15 Took 1.935 sec. with nw=16

After: 3.915 sec. with nw=1 Took 2.126 sec. with nw=2 Took 1.425 sec. with nw=3 Took 1.127 sec. with nw=4 Took 0.971 sec. with nw=5 Took 0.840 sec. with nw=6 Took 0.755 sec. with nw=7 Took 0.691 sec. with nw=8 Took 0.679 sec. with nw=9 Took 0.684 sec. with nw=10 Took 0.667 sec. with nw=11 Took 0.655 sec. with nw=12 Took 0.667 sec. with nw=13 Took 0.637 sec. with nw=14 Took 0.638 sec. with nw=15 Took 0.608 sec. with nw=16

Which is 3x speed up for 512^3 image from the test. Scaling over a number of threads is better, and CPU utilization is better for a bigger number of threads.

Also, I have noticed that the first run always takes more time (almost twice as much). At first, I thought that the issue was that memory allocation was done the first time, and then OS reused the same pages, but when making only one allocation at the beginning, issues seemed to persist. That's why I have added "warm up" before running the test. Let me know if you have any ideas why is that.

william-silversmith commented 8 months ago

Thanks so much for this PR. Sorry the initial notifications got lost in my inbox. Thank you for your persistence! I tried this on some real data and got some nice results, including the speedup for 1 thread that you saw as well.

# old results
12.29  sec. t=1
 6.51  sec. t=2
 5.17  sec. t=3
 4.73  sec. t=4
 4.41  sec. t=5
 4.02  sec. t=6
 3.78  sec. t=7
 3.68  sec. t=0
# new results
11.98  sec. t=1
 6.04  sec. t=2
 4.78  sec. t=3
 4.29  sec. t=4
 3.72  sec. t=5
 3.32  sec. t=6
 3.02  sec. t=7
 2.94  sec. t=0

Your rationale makes sense to me as well. I knew the units of work were small, but for some reason it didn't occur to me to try this.

Thank you for a great contribution!