Open GoogleCodeExporter opened 8 years ago
Thanks again. Extra kudos for also providing the test code.
I need to run off to Excel to chart this. I would imagine that the knee of the
curve is above 256 but below 4092. I would argue that Aparapi is really not
suitable for global sizes < 1k-2k...
As you eluded. This is actually an artifact of the global size and the default
method for choosing group sizes. It is also a problem exposed by choosing a
default group size before knowing which device to pick.
By default (when executing kernel.execute(int)) aparapi creates an interim
Range object, but does not *know* where the actual code will be executed
(OpenCL GPU, CPU, JTP or SEQ) so we pick a range which is optimal for GPUs.
This means that we try to get a group size as close to 256 as we can. For JTP
this actually means we will spawn 256 threads! For very large global sizes
this actually works out to be good (especially for regular and predictable
compute loads), for smaller sizes this turns out to be an anti-pattern.
Clearly we need a better approach.
You will notice that more recently I am pushing people towards choosing a
device, creating a Range for that device and then dispatching using a specific
Range.
Device device = .... // get device
Range range = device.createRange(globalSize)
kernel.execute(range)
For GPU devices (at present) in the main trunk this ensures that the Range is
'ideal' for the chosen device. My hope is to use this pattern for JTP also -
this code is not complete or even fleshed out.
Device device = Device.JTP(); // no such API
Range range = device.createRange(globalSize)
kernel.execute(range)
This would allow the range to match the # of cores in the case of JTP.
There will still be the issue of 'fall-back' for when the bytecode cannot be
converted to OpenCL. In this case JTP is just a safety net and performance may
well always lag SEQ for small (<4k) global sizes.
I will keep this open and will try to come up with a better 'default' strategy.
Gary
Original comment by frost.g...@gmail.com
on 9 Aug 2012 at 2:13
Why is one CPU thread spawned for each work-item in a work group? It is more
natural to execute one group on one CPU core. This mirrors how groups are
executed on a GPU. If a kernel is optimized for memory locality (ie it uses
shared memory or L1 cache), it should be faster. In any case, setting group
size to 256 should be [mostly] optimal for any opencl device. I hear HotSpot 8
will autovectorize, in which case, again, all work-items of a group should be
in one thread.
Original comment by adubin...@almson.net
on 15 Feb 2013 at 11:53
We have to spawn one thread per work item, otherwise a barrier() across the
group would deadlock. It is the closest way to emulate OpenCL behaviour.
I am not a fan of this, if you can envision a better way I would love to try
it.
Original comment by frost.g...@gmail.com
on 16 Feb 2013 at 12:03
Well how do you implement SEQ, then?
A work group should be realized as a for() loop going over each work item.
Every time there is a barrier, you start a new for() loop. Do not use real
thread synchronization primitives.
Eg:
for(int i = 0; i < groupSize; i++)
{
// do stuff per thread
// any call to getGlobalId() returns i
}
// barrier() was executed here
for(int i = 0; i < groupSize; i++)
{
// continue our kernel
}
The big drawback of this is that it hides concurrency bugs. For debugging, I
suppose, the code can be executed the way it is now (although, in reality,
emulating the concurrent idiosyncracies of real GPUs is a huge task in
itself... better to imagine some sort of real in-hardware debugger).
Original comment by adubin...@almson.net
on 16 Feb 2013 at 12:53
adubinsky,
Thank you for the suggestion. Are you available and/or interested in looking at
the latest Trunk code and implementing your suggested fix in a branch that we
can test?
Original comment by ryan.lam...@gmail.com
on 22 Apr 2013 at 5:11
I got a roughly 15x speed-up in one app in JTP mode by modifying KernelRunner
to use a standard thread pool (java.util.concurrent.Executors/ExecutorService).
There's a tremendous amount of overhead in creating and destroying threads
rapidly.
I added one field:
private final ExecutorService threadPool = Executors.newCachedThreadPool();
I removed threadArray since it wasn't really used, and instead of new
Thread().start():
threadPool.submit(new Runnable(){....});
Without changing workgroup size/dimensions, this was a very effective speedup.
Original comment by paul.mi...@gmail.com
on 12 Jun 2013 at 1:42
(forgot to add, need to do a threadPool.shutdownNow() within the dispose()
method)
Original comment by paul.mi...@gmail.com
on 13 Jun 2013 at 7:22
paul:
A threadpool can't be used. Your threads will deadlock if they try to
synchronize.
ryan:
I'm not able to help. But I can say the strategy is to use continuations.
There's some libraries available
(http://stackoverflow.com/questions/2846428/available-coroutine-libraries-in-jav
a) but they seem pretty old and unmaintained.
Anyway, it proves it is possible in Java. A custom implementation can perhaps
be simpler and faster. (Eg, exploit the fact there's no recursion in OpenCL and
the used stack size can be pre-computed.)
Original comment by adubin...@almson.net
on 21 Jun 2013 at 5:47
Why would a threadpool cause a deadlock? The only difference is that the
threadpool will re-use threads. A thread is not "tainted" from running a
kernel, and so should be re-usable.
Original comment by paul.mi...@gmail.com
on 21 Jun 2013 at 8:30
I think the concern is that unlike most multithreaded applications Aparapi apps
must map to the 'work group' model used by OpenCL.
This is required so that Kernel.localBarrier() is honored.
My take is that provided the pool of threads is equal to the width of a group
(which I think is what we have) then we are safe.
If the pool was smaller than a group, we would indeed deadlock if a kernel
contained
Kernel k = new Kernel(){
public void run(){
// do something
localBarrier();
// do something else
}
};
'adubinsky' (apologies I do not know your name) is it your understanding that
by accepting this patch that we now may deadlock. If so can you elaborate, I
still think we are good.
BTW Continuations would be very cool indeed. I have seen some work attempting
to do this in Java, I must admit it was something I am glad I did not take on ;)
Gary
Original comment by frost.g...@gmail.com
on 21 Jun 2013 at 9:15
The threadpool uses the same safety mechanism already in place for the new
Thread() approach, the join barrier in KernelRunner.java: await(joinBarrier);
Without the barrier, there would be concurrency problems either way, whether
the threads are newly constructed or re-used.
Original comment by paul.mi...@gmail.com
on 21 Jun 2013 at 9:21
Paul I agree, I just want to make sure that we are not missing something and
want to give 'adubinsky' a chance to elaborate. There may be a corner case we
have missed.
Gary
Original comment by frost.g...@gmail.com
on 21 Jun 2013 at 9:57
Sorry, newCachedThreadPool() should indeed work. I double-checked the docs, and
it guarantees that all submitted tasks are run immediately.
I mixed it up with the more-common newFixedThreadPool, figuring you were trying
to reduce the total concurrent threads. newCachedThreadPool solves the issue of
kernel launch overhead for short-running kernels, but shouldn't speed up
long-running kernels. Using continuations should help in the latter case by
getting rid of the OS overhead.
Original comment by adubin...@almson.net
on 22 Jun 2013 at 6:29
Original issue reported on code.google.com by
oliver.c...@gmail.com
on 9 Aug 2012 at 6:07Attachments: