imglib / imglib2-algorithm

Image processing algorithms for ImgLib2
http://imglib2.net/
Other
22 stars 20 forks source link

Speed up Gauss3.gauss() by using the imglib2 "Parallelization" class #93

Closed maarzt closed 2 years ago

maarzt commented 2 years ago

This PR changes how multi threading is done in the class Gauss3 and the package net.imglib2.algorithm.convolution. The code now simply uses the multi threading that's build into LoopBuilder.

This noticeably increases the performance. The reason being that Gauss3.gauss(sigma, source, target) no longer creates it's own ExecutorService. It instead uses imglib2's Parallelization class which by default uses the global ForkJoinPool. This reduces the number of threads that the code creates, and thereby improves the performance.

For Labkit's pixel classification the speed difference is roughly 30%. (Labkit makes heavy use of imglib2 convolutions) A simple benchmark also shows the difference. Blurring a small 100x100 pixel image with Gauss3.gauss now takse 0.22 ms on my machine. The old code required 0.58 ms.

tpietzsch commented 2 years ago

Very nice!

Could you put a bit more work into Gauss3? I would

tpietzsch commented 2 years ago

I played a bit with it, and the speedup seems to be not only due to the creation/shutdown of the ExecutorService. There is overhead that scales also to bigger images (i.e. 100x100x100), where I would otherwise expect the creation/shutdown overhead to disappear relative to the actual computations.

It seems to be also due to using the common ForkJoinPool instead of a FixedThreadPool. I can basically get the same performance by passing Parallelization.getExecutorService() to the old code. Just passing a FixedThreadPool with the number of threads == parallelism level of the ForkJoinPool does not work. I can get closer by putting a FixedThreadPool with more threads, but not as good as the common pool. Very interesting.

maarzt commented 2 years ago

@tpietzsch Thanks for the review. I updated the javadoc as you suggested.

Strange that the ForkJoinPool is overall faster than the FixedThreadPool. If have no idea why.

maarzt commented 2 years ago

Is there something else to do or is the PR ready for merge?

tpietzsch commented 2 years ago

Strange that the ForkJoinPool is overall faster than the FixedThreadPool. If have no idea why.

If I understand correctly, the ForkJoinPool will start new threads to maintain its parallelism level if other threads are stalled. I think that must be it, but I don't really see where that would help the convolution

maarzt commented 2 years ago

Cool it's merged :+1: :sparkles:

I have a slightly different understanding of the ForkJoinPool: Usually when using a thread pool. It goes like this: All the tasks are submitted to the pool and thread who submitted the tasks waits for them to finish. During this time this thread is basically blocked.

This is different when the ForkJoinPool is used together with ForkJoinTasks. Here the thread submitting the tasks, will not only wait for the tasks to finish but it will as well help processing them. That's the trick.

tpietzsch commented 2 years ago

This is different when the ForkJoinPool is used together with ForkJoinTasks. Here the thread submitting the tasks, will not only wait for the tasks to finish but it will as well help processing them. That's the trick.

I don't think that's the only thing. The waiting thread doesn't consume any CPU really, so it would be just a matter of adding more threads to the FixedThreadPool. That actually helps. On my computer, the common pool parallelism is 11, numProcessors is 12. If I use a FixedThreadPool with 24 threads, I get much closer to common pool performance than with a FixedThreadPool of 12 threads. But common pool is still a bit faster. It could be that this is due to other effects like different number of tasks, etc.

maarzt commented 2 years ago

On my computer the story is completely different. The CPU has 4 cores and does Hyperthreading, so it can actively run 8 threads in parallel. There's not much of a performance difference between a FixedThreadPool with 4 or 8 threads, and ForkJoinPool also shows similar performance...