Open GoogleCodeExporter opened 9 years ago
At present there are no mechanisms for avoiding buffer transfers between
Kernels. Actually even multiple instances of the same Kernel will cause a
transfer. The 'cached' information (regarding what has been transferred to the
device) is on a per kernel instance basis.
Actually this can also be tricky from OpenCL itself, unless both Kernel's are
in the same program file (and share the same context I think).
Maybe you are being forced to create two different Kernel's because we don;t
currently offer a mechanism for having multiple entrypoints. Is this this case?
If for example we allowed
Kernel k = new Kernel(){
run1(){}
run2(){}
}
And your algorithms were expressed using run1() and run2() would this work for
you. At present this does not work, but something like this has been proposed
and is possible.
Is this something that would help you if it existed?
Gary
Original comment by frost.g...@gmail.com
on 14 Jul 2012 at 9:49
Hello Gary,
Thank you for the answer! In pure opencl I can transfer a variable to the
device memory and then I can use it as argument of two (or more) kernels (not
very tricky).
My algorithm needs more kernels that work all on some arrays without the
necessity of the CPU. These arrays may be very large and so I have to avoid too
many useless transfers. If multiple entrypoints was the only cheat to have
common variables, then it could be a useful help, but I think that aparapi
needs more control on memory transfers, because in many algorithms memory is
the key.
I hope I helped you,
Good work,
Egidio
Original comment by egidio.d...@gmail.com
on 14 Jul 2012 at 10:04
Is it possible for you to attach example code demonstrating this capability?
Original comment by ryan.lam...@gmail.com
on 2 Aug 2012 at 3:47
I second the motion for being able to share buffers between multiple kernels!
Somehow I got the impression that this was already possible with Aparapi, and
I've started implementation using this assumption... gah. :)
I don't think multiple entry points for the same kernel would be workable for
me.
Finally, in the initial issue comment, it is noted that the arrays being put()
to each kernel actually reference the same array. Following the "write once,
run anywhere" mantra, the same code should behave in the same way regardless of
execution platform, so perhaps it makes sense that when putting "multiple"
arrays that are actually references to a single array then that array should be
shared between all the kernels to which it's put()?
Of course, explicit memory management combined with multiple kernels throws a
spanner in the works of "write once, run anywhere": if one kernel ends up
executing in JTP or SEQ mode and another in GPU mode while explicit memory
management is being used then sharing the arrays/buffers between these kernels
isn't going to work without transfers (which we've explicitly said should be
handled explicitly...). Perhaps in this case transfers could be forced, or the
GPU kernel forced to run in JTP mode, or an exception thrown and execution
aborted? None of these seem like great options, but perhaps they would be
better than the program not working as expected/intended?
Original comment by oliver.c...@gmail.com
on 14 Aug 2012 at 1:16
This is pretty important for pipeline-based approaches like ours. We pass data
through stages frequently, otherwise, shuttling data back and forth is a major
bottleneck (for us, anyway). I have been using GLSL, CUDA, etc, for years and
it's super easy over there.
Could we have a call something like Kernel.linkData(kernel1.outputFloats,
kernel2.inputFloats) to be used similar to put()?
Original comment by kris.woo...@gmail.com
on 2 Oct 2012 at 10:51
Any example code or test cases we can use in development to understand and
address this issue?
Original comment by ryan.lam...@gmail.com
on 3 Oct 2012 at 1:08
For a real-world project you could look at https://github.com/OliverColeman/bain
This is a neural network simulator framework where the neurons and synapses are
encapsulated in their own collections that implement the Kernel interface. The
idea is that people can easily create and plugin different models of neurons or
synapses by extending the neuron and synapse collection super classes (it's
designed to support biophysical models or models somewhere between biophysical
and classical computer science models). This there is a need to share neuron
output and synapse output (and perhaps other data) between these two kernels.
In theory the framework could be modified to use a multiple entry point kernel,
however Aparapi doesn't quite support what would be required. I had a
discussion with Gary about it a while ago but just realised I never heard back
from him. It also contains more details about the simulator:
Oliver Coleman wrote:
For the project I'm working on I either need functionality for shared buffers
between kernels, or kernels with multiple entry points (I realised that
multiple entry points could perhaps work for my case when I realised that
different entry points could be run with different execute ranges: my project
is a neural network simulator and needs to execute separate computations for
the neurons and synapses, and there are typically many more synapses than
neurons). I've attempted to use the method to emulate multiple entry points
described at
https://code.google.com/p/aparapi/wiki/EmulatingMultipleEntrypointsUsingCurrentA
PI , but so far have had no success, and am wondering if there's any hope for
the approach I'm trying; I hope someone can offer some insight!
All neurons and synapses are contained in their own respective "collection"
objects, which consist of single dimension arrays of primitives for the state
variables (containing an element for each neuron or synapse in the collection).
Initially I had it set-up so that a collection extended Kernel, which worked
fine, except without being able to use shared buffers required transferring
some buffers back and forth for every simulation step. The neuron and synapse
collections are contained in a Simulation object, which for every simulation
step simulates the neurons and then the synapses by calling their respective
step() methods (which initially would put() the relevant buffers call execute()
and then get() the relevant buffers).
I've now modified it so that the Simulation class extends Kernel, and the
collections provide a step(int index) method, which replaces the original run()
method in the collections. In the Simulation run() method I try to call either
neurons.step(getGlobalId()) or synapses.step(getGlobalId()), but get the error
"Using java objects inside kernels is not supported". Is this because the
primitive arrays being accessed in step(int index) are inside the collection
objects?
I can't pull the primitive arrays out of the neuron and synapse collection
objects and into the Simulation object as these collections extend a base
collection class and add their own primitive arrays depending on the neuron or
synapse model being used. Below are some snippets of relevant code.
public class Simulation extends Kernel {
protected NeuronCollection neurons;
protected SynapseCollection synapses;
protected ComponentCollection kernelEntryPoint;
public synchronized void step() {
neurons.preStep();
synapses.preStep();
kernelEntryPoint = neurons;
execute(executeRangeNeurons);
kernelEntryPoint = synapses;
execute(executeRangeSynapses);
neurons.postStep();
synapses.postStep();
step++;
}
@Override
public void run() {
kernelEntryPoint.step(this.getGlobalId());
}
}
public abstract class NeuronCollection extends ConfigurableComponentCollection {
protected double[] neuronOutputs;
protected boolean[] neuronSpikings;
protected double[] neuronInputs;
@Override
public void preStep() {
super.preStep();
if (inputsModified) {
simulation.put(neuronInputs);
}
}
@Override
public void step(int neuronID) {
neuronInputs[neuronID] = 0;
neuronSpikings[neuronID] = neuronOutputs[neuronID] >= 1;
}
}
public class LinearNeuronCollection extends NeuronCollection {
@Override
public void step(int neuronID) {
if (neuronID >= size)
return;
neuronOutputs[neuronID] = neuronInputs[neuronID];
super.step(neuronID);
}
}
gfrost frost.gary@gmail.com via googlegroups.com 17 Aug to aparapi-discuss
Oliver thanks for reading the proposal! ;)
A few notes inline below.
>On Thursday, August 16, 2012 12:47:34 AM UTC-5, Oliver Coleman wrote:
>For the project I'm working on I either need functionality for shared buffers
between kernels, or kernels with multiple entry points (I realised that
multiple entry points could perhaps work for my case when I realised that
different entry points could be run with different execute ranges: my project
is a neural network simulator and needs to execute separate computations for
the neurons and synapses, and there are typically many more synapses than
neurons). I've attempted to use the method to emulate multiple entry points
described at
https://code.google.com/p/aparapi/wiki/EmulatingMultipleEntrypointsUsingCurrentA
PI , but so far have had no success, and am wondering if there's any hope for
the approach I'm trying; I hope someone can offer some insight!
>Two of the main reasons for supporting multiple entrypoints are:
>1) so that different entrypoints could be executed over different ranges.
>2) so that different entrypoints can operate on different sets of buffers.
>All neurons and synapses are contained in their own respective "collection"
objects, which consist of single dimension arrays of primitives for the state
variables (containing an element for each neuron or synapse in the collection).
Initially I had it set-up so that a collection extended Kernel, which worked
fine, except without being able to use shared buffers required transferring
some buffers back and forth for every simulation step. The neuron and synapse
collections are contained in a Simulation object, which for every simulation
step simulates the neurons and then the synapses by calling their respective
step() methods (which initially would put() the relevant buffers call execute()
and then get() the relevant buffers).
>I've now modified it so that the Simulation class extends Kernel, and the
collections provide a step(int index) method, which replaces the original run()
method in the collections. In the Simulation run() method I try to call either
neurons.step(getGlobalId()) or synapses.step(getGlobalId()), but get the error
"Using java objects inside kernels is not supported". Is this because the
primitive arrays being accessed in step(int index) are inside the collection
objects?
Indeed the collections in this case are treated as 'other objects' and Aparapi
is refusing to look inside these objects. In this case BTW it would be
possible to make your code work (with Aparapi changes) because you really do
have parallel primitive arrays, they just happen to be held in nested
containers.
>I can't pull the primitive arrays out of the neuron and synapse collection
objects and into the Simulation object as these collections extend a base
collection class and add their own primitive arrays depending on the neuron or
synapse model being used. Below are some snippets of relevant code.
Ahh.. so the collections in your example are representing an example of a
nueron implementation, that an end user should be able to replace or overwrite.
So the problem here is that the required OpenCL code to access these primitives
would indeed be different for each type of neuron/synapses that you allow. The
Kernel's types are unbound - only when we know the exact neuron/synapse
configuration - can we create the OpenCL for the kernel-.
I am not sure that multiple-entrypoints will help you. In a way you want to be
able to configure the OpenCL creation by mixing in various flavors of
synapse/neuron....
I need to think about this some more....
Oliver Coleman oliver.coleman@gmail.com via googlegroups.com
17 Aug to aparapi-discuss
Hi Gary, thanks so much for speedy response, I think I have more of a feel for
why this won't work as I have tried to use it.
I've been thinking about this some more, and am pretty sure that there
shouldn't be any great technical hurdles to making this work (as you indicate,
since the parallel primitives just happen to be to held in nested objects),
with some level of restrictions on the nested objects. I think the levels, most
restrictive and so perhaps easier to handle in Aparapi first, are something
like:
* Create a sub-class of Simulation which references specific sub-classes of
NeuronCollection and SynapseCollection (and perhaps making these Collection
sub-classes classes final), in this way it is guaranteed that the primitive
arrays are fixed for the sub-class of Simulation; or
* Specify the specific sub-classes of neuron and synapse collections via
generics in the Simulation class (again perhaps making these Collection
sub-classes classes final), in this way it is guaranteed that for a particular
instance of Simulation that the primitive arrays are fixed.
I suppose it depends on whether the primitive arrays (or arrays of objects
containing primitive fields) need to be bounded for the kernel class or a
kernel instance.
Cheers,
Oliver
Original comment by oliver.c...@gmail.com
on 3 Oct 2012 at 1:40
Bump... my project is basically stalled on this issue (and/or the multiple
entry point Kernel issue, see above). Is this being actively worked on? If not,
are there any definite plans to work on it?
Cheers,
Oliver
Original comment by oliver.c...@gmail.com
on 8 Feb 2013 at 12:09
[deleted comment]
Here is a self-contained example of what would be a typical operation that
needs unnecessary transfers, see comments to indicate where the "problem" is:
import java.util.Random;
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
/**
* Typical gaussian convolution on an image
*/
public class GaussianKernel extends Kernel {
/**
* Runs a http://en.wikipedia.org/wiki/Difference_of_Gaussians
* @param args
*/
public static void main(String args[]){
// create random noise input
int w = 50;
int h = 50;
float sourceImg[] = new float[w*h];
Random rnd = new Random();
for (int i = 0; i < sourceImg.length; i++) {
sourceImg[i] = rnd.nextInt(256);
}
// setup gaussian gpu kernels
GaussianKernel t1 = new GaussianKernel(100, 100, 100);
t1.setSource(sourceImg, w, h);
GaussianKernel t2 = new GaussianKernel(100, 100, 100);
t2.setSource(sourceImg, w, h);
// do 2 filterings
float[] result1 = t1.filter(5);
float[] result2 = t2.filter(10);
// subtract the two
SubtractKernel sub = new SubtractKernel(result1.length);
// PROBLEM: result1 and result2 had to transfer from GPU,
// will now be transfered back (to be subtracted), then the
// final results come back from GPU again (4 extra transfers)
// Is there any elegant way to bypass those transfers?
float[] finalResult = sub.subtract(result1, result2);
// print out difference of gaussians
for (int i = 0; i < finalResult.length; i++) {
System.out.println(result1[i] +","+result2[i] +","+finalResult[i]);
}
}
final int maxWidth;
final int maxHeight;
final float[] width;
final float[] height;
final float[] radius;
final float[] pass;
final float[] gaussianKernel;
final float[] midPixels;
final float[] img;
final float[] result;
public GaussianKernel(int maxKernelSize, int maxW, int maxH) {
this.maxWidth = maxW;
this.maxHeight = maxH;
this.gaussianKernel = new float[maxKernelSize];
this.img = new float[maxW*maxH];
this.midPixels = new float[maxW*maxH];
this.result = new float[maxW*maxH];
this.width = new float[1];
this.height = new float[1];
this.radius = new float[1];
this.pass = new float[1];
}
public void setSource(float[] indata, int w, int h) {
this.width[0] = w;
this.height[0] = h;
put(this.width);
put(this.height);
for (int i = 0; i < h; i++) {
System.arraycopy(indata, i*w, img, i*this.maxWidth, w);
}
put(img);
}
public float[] filter(int rad) {
Range imgRange = Range.create2D(maxWidth, maxHeight);
radius[0] = rad;
put(radius);
float[] inKernel = createGaussianKernel(rad);
System.arraycopy(inKernel, 0, this.gaussianKernel, 0, inKernel.length);
put(gaussianKernel);
// do horizontal filter
pass[0] = 0;
put(pass);
execute(imgRange);
// do vertical filter
pass[0] = 1;
put(pass);
execute(imgRange);
// get the results
get(result);
return result;
}
public void run() {
int x = getGlobalId(0);
int y = getGlobalId(1);
int mxW = getGlobalSize(0);
float v = 0;
int rad = (int)radius[0];
int w = (int)width[0];
int h = (int)height[0];
float blurFactor = 0f;
int off = 0;
float sample = 0f;
if (x<w && y<h) {
// horizontal pass
if (pass[0]<0.5) {
for (int i = -rad; i <= rad; i++) {
int subOffset = x + i;
if (subOffset < 0)
subOffset = 0;
else if (subOffset >= w)
subOffset = w-1;
sample = img[y*mxW + subOffset];
off = i;
if (off<0)
off = -off;
blurFactor = gaussianKernel[off];
v += blurFactor * sample;
}
midPixels[y*mxW+x] = v;
}
// vertical pass
else {
for (int i = -rad; i <= rad; i++) {
int subOffset = y + i;
if (subOffset < 0)
subOffset = 0;
else if (subOffset >= h)
subOffset = h-1;
sample = midPixels[subOffset*mxW + x];
off = i;
if (off<0)
off = -off;
blurFactor = gaussianKernel[off];
v += blurFactor * sample;
}
result[y*mxW+x] = v*0.5f;
}
}
else
result[y*mxW+x] = 0;
}
public float[] createGaussianKernel(int radius) {
if (radius < 1) {
throw new IllegalArgumentException("Radius must be >= 1");
}
float[] data = new float[radius + 1];
float sigma = radius / 3.0f;
float twoSigmaSquare = 2.0f * sigma * sigma;
float sigmaRoot = (float) Math.sqrt(twoSigmaSquare * Math.PI);
float total = 0.0f;
for (int i = 0; i <= radius; i++) {
float distance = i * i;
data[i] = (float) Math.exp(-distance / twoSigmaSquare) / sigmaRoot;
total += data[i];
}
for (int i = 0; i < data.length; i++) {
data[i] /= total;
}
return data;
}
}
/**
* Simple kernel, subtracts the two values
*/
class SubtractKernel extends Kernel {
final float[] left;
final float[] right;
final float[] result;
public SubtractKernel(int size) {
left = new float[size];
right = new float[size];
result = new float[size];
}
public float[] subtract(float[] a, float[] b) {
assert(a.length==b.length);
Range imgRange = Range.create(a.length);
System.arraycopy(a, 0, left, 0, a.length);
System.arraycopy(b, 0, right, 0, b.length);
put(left);
put(right);
execute(imgRange);
get(result);
return result;
}
@Override
public void run() {
int x = getGlobalId(0);
result[x] = right[x] - left[x];
}
}
Original comment by kris.woo...@gmail.com
on 8 Feb 2013 at 1:26
Any progress on this? Please...?
Original comment by oliver.c...@gmail.com
on 29 Apr 2014 at 5:26
I'm also curious about any progress. It's a showstopper for us adopting
Aparapi on any real production / larger scale.
Original comment by kris.woo...@gmail.com
on 29 Apr 2014 at 5:39
I have to admit that I have not been working on this. For this to work we would
need to have to break the relationship between Kernel and buffers. At present
the lifetime of a buffer is tied to a kernel. For this to be implemented we
would have to expose some cross kernel layer (akin to OpenCL's Context).
So instead of
int[] buf //= ....
kernel1.put(buf)
kernel1.execute(range);
kernel1.get(buf);
kernel1.dispose(); // cleanup buf mapping here
We would need to expose the relationship between kernels. Maybe by having a
context
Context c = new Context();
int[] buf //= ....
c.put(buf)
kernel1.execute(c,range);
kernel2.execute(c,range);
c.get(buf);
kernel1.dispose();
kernel2.dispose();
c.dispose(); // really cleanup the buffers
Original comment by frost.g...@gmail.com
on 29 Apr 2014 at 5:18
Original issue reported on code.google.com by
egidio.d...@gmail.com
on 14 Jul 2012 at 10:36