imglib / imglib2-algorithm

Image processing algorithms for ImgLib2
http://imglib2.net/
Other
22 stars 20 forks source link

Unify multithreading handling #81

Open ctrueden opened 5 years ago

ctrueden commented 5 years ago

Many of the implementations in imglib2-algorithm have an int numThreads parameter for controlling how many threads to spawn. Others take an ExecutorService. Accepting an ExecutorService is more flexible and would mesh better with SciJava Common’s ThreadService and therefore ImageJ Ops. See also imagej/imagej-ops#599. On the other hand, an ExecutorService alone is not enough to identify the intended number of threads to use when partitioning the task.

Note that currently, we often infer numTasks from Runtime.availableProcessors() by default, which is not a good practice in general because multiple tasks may be ongoing simultaneously, which can result in more threads than processors.

@tpietzsch points out:

Based on a chat in gitter.

maarzt commented 5 years ago

I see mayor disadvantages with the suggested solutions:

My idea is to not use numThreads, ExecutorService or ForkJoinPool directly, but instead use an interface:

public interface MultiThreadSetting
{

    boolean useMultiThreading(); // returns true if multi-threading should be used

    int suggestNumberOfTasks(); // returns 1 for single-threading, or 4 * availableProccessors() for multi threading, you may also manually specify this value

    <T> void forEach( Collection< T > values, Consumer< T > action ); // Executes the action for each value, sequentially for single-threading, in parallel for multi-threading.
}

See PR https://github.com/imglib/imglib2/pull/250

For this interface I have three implementations:

  1. Default implementation that uses the common ForkJoinPool.
  2. An implementation based on ExecutorService.
  3. An implementation that uses no multi-threading. This is single threaded, and completely avoids the overhead that you get from using ForkJoinPool or ExecutorService.
ctrueden commented 5 years ago

Thanks a lot @maarzt for digging in to this issue!

I am 100% on board with using our own interface. However, the tricky question is: where shall this interface live? Because I want to use it in both SciJava Common and ImgLib2. Do you have a preference? @tpietzsch what do you think? Maybe a new scijava-threads component?

And @maarzt: have you looked in detail at the SciJava ThreadService? How would you propose we update it to slot in nicely with this new approach?

tpietzsch commented 5 years ago

@maarzt some quick comments

tpietzsch commented 5 years ago

@ctrueden @axtimwalde @hanslovsky @StephanPreibisch We discussed this and @maarzt came up with a pretty fantastic (IMO) proposal. You can have a look at it here (multi-threading branch). The gist of it is that parallelization settings are available as ThreadLocal inside algorithms without passing any ExecutorService or similar argument in. In an algorithm you just call, for example, Parallelization.forEach( elements, operation ); and it runs parallel or sequential according to context.

This can mean either

The decision is made by the caller on the top-level, by running Parallelization.singleThreaded( () -> myAlgorithm() ); or Parallelization.multiThreaded( () -> myAlgorithm() ); Note, that you would (almost?) never call this inside any algorithm. This is done outside by the user.

It ties nicely into existing stuff. For example, if you call elements.parallelStream().forEach( operation ); then that also triggers parallelisation inside operation.

It supports recursive tasks like ForkJoinPool. See https://github.com/maarzt/sandbox/blob/multi-threading/src/main/java/sandbox/parallization/SumAlgorithm.java for an example the recursively splits a RAI by dimension.

@ctrueden This brings us to a remaining question: The "user" would be for example the OpService, CommandService etc. If I call ops.bla(), is that run with the callers parallelisation settings or does the OpService have its own threads and parallelisation settings? (For OpService it should probably run in the calling thread, but for other services it might not be so clear.

Any thoughts, comments?

acardona commented 5 years ago

Hi @maarzt & @tpietzsch,

First: thanks very much for trying. Getting abstracted parallelization right is perhaps impossible. It has, though, its uses, but for the general case, I have yet to stumble upon a good solution.

The first major issue is that algorithm have different memory requirements and different contention issues. There are of course groups of algorithms that share such properties (i.e. filtering an image), and therefore there can be well-suited solutions for them.

An example: suppose I am running a memory-heavy algorithm that uses e.g. 25% of my desktop's RAM. I can only effectively run 4 threads. But, some steps within each parallel processing branch can be done with more threads. Therefore each would use 1/4th of all available threads. When done with this subprocess, then the main, memory-heavy thread would resume (for example, moving on to the next 3D image to process in a 4D series).

An example of a subtask would be to copy a resulting image into an ArrayImg. That's why e.g. I'd need to specify the number of threads in ImgUtil.copy.

If the abstracted parallelization framework assumes that allocating as many free threads as possible to a task is a good idea, the above will fail, or result in suboptimal performance.

Another example: suppose there are multiple data structures that have to be accessed for every iteration in the loop. If one naively uses all available threads to process the next available loop index, contention is high: all threads try to access all data structures at once. Instead, one can chunk the data structures and have each thread process a chunk that the others never access. Is chunking included in the abstract parallelization? For some cases, it would be preferable, because it is cumbersome, and could be done automatically via e.g. Views.interval. But in many cases it cannot.

All that said: "Those who think something can't be done should not stop those who are doing it from doing it." So carry on and let's see how far this goes. Looking forward to the results.

Specific comments on the code: statically instantiating both an ExecutorService and a ForkJoinExecutorService by default is not a good idea. These should be instantiated only if/when needed.

On the "problem" that an ExecutorService doesn't tell you how many threads it is setup to use. That's only true for the excessively abstract interface ExecutorService. Consider using directly e.g. ThreadPoolExecutor, which is the parent class to even the ScheduledThreadPoolExecutor. With ThreadPoolExecutor, you have methods like getPoolSize and others that let you find out how many Threads are available, and how many are actively running presently, etc.

ctrueden commented 5 years ago

@ctrueden This brings us to a remaining question

Let's discuss when I visit in June. Added to my list for that week.

tpietzsch commented 5 years ago

@ctrueden

Let's discuss when I visit in June. Added to my list for that week.

Sounds good.

@acardona I agree that a general solution is basically impossible. There will always be cases where it is advisable to roll your own multithreading solution. This proposal is more about the 80% of cases where you just want to chunk your image and get on with it... :-)

Specifically, the first scenario you mentioned would be possible to do with a ForkJoinPool (resp. the proposed Parallelization). You could just split the whole task into 4 parts and those parts could then submit many subtasks and wait for completion. It gets ugly if you do not want to simply split into 4 equal parts, for example because it cannot be predicted whether the parts would run for about the same time. Even then, it could probably be shoehorned in somehow, but it would be ugly and a special-case solution would be preferable.

Regarding the second scenario, chunking is not included in the abstract parallelization. The idea is that you say Parallelization.getParallelism() and split accordingly into subtasks, then submit those. The problem of contention on "hot" data structures is not solved by the proposed framework. It is only about specifying which task can be run concurrently and which dependencies between tasks exist. How the tasks are distributed onto threads is not controllable. To some extent this is even true if you implement a special case solution, unless you have full control over which machine this runs on and what other things are running at the same time. If you create 8 threads you can only hope that they will actually all be running at the same time. Anyway, the proposal offers no solution there. You could do striping of data structures as you proposed, but by task, not by thread. This will work but is probable over-cautious.

Specific comments on the code: statically instantiating both an ExecutorService and a ForkJoinExecutorService by default is not a good idea. These should be instantiated only if/when needed.

Hmm, why? These are two super light-weight objects. Neither the SequentialExecutorService nor the ForkJoinExecutorService create threads or hold references to any data. (SequentialExecutorService just immediately runs everything in the calling thread, ForkJoinExecutorService submits to the common pool or whatever pool the calling thread is running in.)