Exchange variables between kernels without get&put on host memory

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I create two classes extending kernel
2. In the constructor I pass the same array. Then with setexplicit(true) and 
put I ask to copy memory from host to GPU.
3. Execute the two kernels

What is the expected output? What do you see instead?
They work on independent variables even if, before the execution, the variables 
on the CPU point to the same memory location.
I don't know if there is a way to exchange variables between kernels without 
passing on the host memory. This example works with setexplicit(false) and if I 
use only a kernel more times (similar to your examples in the wiki).

What version of the product are you using? On what operating system?
The version used is the stable one found on "downloads".

I hope this could be useful. I think that aparapi is fantastic speed up my work!
Thank you!

Original issue reported on code.google.com by egidio.d...@gmail.com on 14 Jul 2012 at 10:36

GoogleCodeExporter commented 9 years ago

At present there are no mechanisms for avoiding buffer transfers between 
Kernels.  Actually even multiple instances of the same Kernel will cause a 
transfer.  The 'cached' information (regarding what has been transferred to the 
device) is on a per kernel instance basis. 

Actually this can also be tricky from OpenCL itself, unless both Kernel's are 
in the same program file (and share the same context I think).

Maybe you are being forced to create two different Kernel's because we don;t 
currently offer a mechanism for having multiple entrypoints.  Is this this case?

If for example we allowed 

Kernel k = new Kernel(){
  run1(){}
  run2(){}
}

And your algorithms were expressed using run1() and run2() would this work for 
you.  At present this does not work, but something like this has been proposed 
and is possible. 

Is this something that would help you if it existed?

Gary

Original comment by frost.g...@gmail.com on 14 Jul 2012 at 9:49

GoogleCodeExporter commented 9 years ago

Hello Gary,
Thank you for the answer! In pure opencl I can transfer a variable to the 
device memory and then I can use it as argument of two (or more) kernels (not 
very tricky). 
My algorithm needs more kernels that work all on some arrays without the 
necessity of the CPU. These arrays may be very large and so I have to avoid too 
many useless transfers. If multiple entrypoints was the only cheat to have 
common variables, then it could be a useful help, but I think that aparapi 
needs more control on memory transfers, because in many algorithms memory is 
the key.
I hope I helped you,
Good work,
Egidio

Original comment by egidio.d...@gmail.com on 14 Jul 2012 at 10:04

GoogleCodeExporter commented 9 years ago

Is it possible for you to attach example code demonstrating this capability?

Original comment by ryan.lam...@gmail.com on 2 Aug 2012 at 3:47

GoogleCodeExporter commented 9 years ago

I second the motion for being able to share buffers between multiple kernels! 
Somehow I got the impression that this was already possible with Aparapi, and 
I've started implementation using this assumption... gah. :)

I don't think multiple entry points for the same kernel would be workable for 
me.

Finally, in the initial issue comment, it is noted that the arrays being put() 
to each kernel actually reference the same array. Following the "write once, 
run anywhere" mantra, the same code should behave in the same way regardless of 
execution platform, so perhaps it makes sense that when putting "multiple" 
arrays that are actually references to a single array then that array should be 
shared between all the kernels to which it's put()?

Of course, explicit memory management combined with multiple kernels throws a 
spanner in the works of "write once, run anywhere": if one kernel ends up 
executing in JTP or SEQ mode and another in GPU mode while explicit memory 
management is being used then sharing the arrays/buffers between these kernels 
isn't going to work without transfers (which we've explicitly said should be 
handled explicitly...). Perhaps in this case transfers could be forced, or the 
GPU kernel forced to run in JTP mode, or an exception thrown and execution 
aborted? None of these seem like great options, but perhaps they would be 
better than the program not working as expected/intended?

Original comment by oliver.c...@gmail.com on 14 Aug 2012 at 1:16

GoogleCodeExporter commented 9 years ago

This is pretty important for pipeline-based approaches like ours.  We pass data 
through stages frequently, otherwise, shuttling data back and forth is a major 
bottleneck (for us, anyway).  I have been using GLSL, CUDA, etc, for years and 
it's super easy over there. 

Could we have a call something like Kernel.linkData(kernel1.outputFloats, 
kernel2.inputFloats) to be used similar to put()?

Original comment by kris.woo...@gmail.com on 2 Oct 2012 at 10:51

GoogleCodeExporter commented 9 years ago

Any example code or test cases we can use in development to understand and 
address this issue?

Original comment by ryan.lam...@gmail.com on 3 Oct 2012 at 1:08

GoogleCodeExporter commented 9 years ago

For a real-world project you could look at https://github.com/OliverColeman/bain
This is a neural network simulator framework where the neurons and synapses are 
encapsulated in their own collections that implement the Kernel interface. The 
idea is that people can easily create and plugin different models of neurons or 
synapses by extending the neuron and synapse collection super classes (it's 
designed to support biophysical models or models somewhere between biophysical 
and classical computer science models). This there is a need to share neuron 
output and synapse output (and perhaps other data) between these two kernels.

In theory the framework could be modified to use a multiple entry point kernel, 
however Aparapi doesn't quite support what would be required. I had a 
discussion with Gary about it a while ago but just realised I never heard back 
from him. It also contains more details about the simulator:

Oliver Coleman wrote:

For the project I'm working on I either need functionality for shared buffers 
between kernels, or kernels with multiple entry points (I realised that 
multiple entry points could perhaps work for my case when I realised that 
different entry points could be run with different execute ranges: my project 
is a neural network simulator and needs to execute separate computations for 
the neurons and synapses, and there are typically many more synapses than 
neurons). I've attempted to use the method to emulate multiple entry points 
described at 
https://code.google.com/p/aparapi/wiki/EmulatingMultipleEntrypointsUsingCurrentA
PI , but so far have had no success, and am wondering if there's any hope for 
the approach I'm trying; I hope someone can offer some insight!

All neurons and synapses are contained in their own respective "collection" 
objects, which consist of single dimension arrays of primitives for the state 
variables (containing an element for each neuron or synapse in the collection). 
Initially I had it set-up so that a collection extended Kernel, which worked 
fine, except without being able to use shared buffers required transferring 
some buffers back and forth for every simulation step. The neuron and synapse 
collections are contained in a Simulation object, which for every simulation 
step simulates the neurons and then the synapses by calling their respective 
step() methods (which initially would put() the relevant buffers call execute() 
and then get() the relevant buffers).

I've now modified it so that the Simulation class extends Kernel, and the 
collections provide a step(int index) method, which replaces the original run() 
method in the collections. In the Simulation run() method I try to call either 
neurons.step(getGlobalId()) or synapses.step(getGlobalId()), but get the error 
"Using java objects inside kernels is not supported". Is this because the 
primitive arrays being accessed in step(int index) are inside the collection 
objects?

I can't pull the primitive arrays out of the neuron and synapse collection 
objects and into the Simulation object as these collections extend a base 
collection class and add their own primitive arrays depending on the neuron or 
synapse model being used. Below are some snippets of relevant code.

public class Simulation extends Kernel {
    protected NeuronCollection neurons;
    protected SynapseCollection synapses;
    protected ComponentCollection kernelEntryPoint;

    public synchronized void step() {
        neurons.preStep();
        synapses.preStep();

        kernelEntryPoint = neurons;
        execute(executeRangeNeurons);

        kernelEntryPoint = synapses;
        execute(executeRangeSynapses);

        neurons.postStep();
        synapses.postStep();

        step++;
    }

    @Override
    public void run() {
        kernelEntryPoint.step(this.getGlobalId());
    }
}

public abstract class NeuronCollection extends ConfigurableComponentCollection {
    protected double[] neuronOutputs;
    protected boolean[] neuronSpikings;
    protected double[] neuronInputs;

    @Override
    public void preStep() {
        super.preStep();
        if (inputsModified) {
            simulation.put(neuronInputs);
        }
    }

    @Override
    public void step(int neuronID) {
        neuronInputs[neuronID] = 0;
        neuronSpikings[neuronID] = neuronOutputs[neuronID] >= 1;
    }
}

public class LinearNeuronCollection extends NeuronCollection {
    @Override
    public void step(int neuronID) {
        if (neuronID >= size)
            return;
        neuronOutputs[neuronID] = neuronInputs[neuronID];
        super.step(neuronID);
    }
}

gfrost frost.gary@gmail.com via googlegroups.com  17 Aug to aparapi-discuss 

Oliver thanks for reading the proposal! ;)

A few notes inline below. 

>On Thursday, August 16, 2012 12:47:34 AM UTC-5, Oliver Coleman wrote:
>For the project I'm working on I either need functionality for shared buffers 
between kernels, or kernels with multiple entry points (I realised that 
multiple entry points could perhaps work for my case when I realised that 
different entry points could be run with different execute ranges: my project 
is a neural network simulator and needs to execute separate computations for 
the neurons and synapses, and there are typically many more synapses than 
neurons). I've attempted to use the method to emulate multiple entry points 
described at 
https://code.google.com/p/aparapi/wiki/EmulatingMultipleEntrypointsUsingCurrentA
PI , but so far have had no success, and am wondering if there's any hope for 
the approach I'm trying; I hope someone can offer some insight!

>Two of the main reasons for supporting multiple entrypoints are:
>1) so that different entrypoints could be executed over different ranges. 
>2) so that different entrypoints can operate on different sets of buffers.

>All neurons and synapses are contained in their own respective "collection" 
objects, which consist of single dimension arrays of primitives for the state 
variables (containing an element for each neuron or synapse in the collection). 
Initially I had it set-up so that a collection extended Kernel, which worked 
fine, except without being able to use shared buffers required transferring 
some buffers back and forth for every simulation step. The neuron and synapse 
collections are contained in a Simulation object, which for every simulation 
step simulates the neurons and then the synapses by calling their respective 
step() methods (which initially would put() the relevant buffers call execute() 
and then get() the relevant buffers).

>I've now modified it so that the Simulation class extends Kernel, and the 
collections provide a step(int index) method, which replaces the original run() 
method in the collections. In the Simulation run() method I try to call either 
neurons.step(getGlobalId()) or synapses.step(getGlobalId()), but get the error 
"Using java objects inside kernels is not supported". Is this because the 
primitive arrays being accessed in step(int index) are inside the collection 
objects?

Indeed the collections in this case are treated as 'other objects' and Aparapi 
is refusing to look inside these objects.  In this case BTW it would be 
possible to make your code work (with Aparapi changes) because you really do 
have parallel primitive arrays, they just happen to be held in nested 
containers. 

>I can't pull the primitive arrays out of the neuron and synapse collection 
objects and into the Simulation object as these collections extend a base 
collection class and add their own primitive arrays depending on the neuron or 
synapse model being used. Below are some snippets of relevant code.

Ahh.. so the collections in your example are representing an example of a 
nueron implementation, that an end user should be able to replace or overwrite. 

So the problem here is that the required OpenCL code to access these primitives 
would indeed be different for each type of neuron/synapses that you allow.  The 
Kernel's types are unbound - only when we know the exact neuron/synapse 
configuration - can we create the OpenCL for the kernel-.

I am not sure that multiple-entrypoints will help you.  In a way you want to be 
able to configure the OpenCL creation by mixing in various flavors of 
synapse/neuron....

I need to think about this some more.... 

Oliver Coleman oliver.coleman@gmail.com via googlegroups.com 
17 Aug to aparapi-discuss 
Hi Gary, thanks so much for speedy response, I think I have more of a feel for 
why this won't work as I have tried to use it.

I've been thinking about this some more, and am pretty sure that there 
shouldn't be any great technical hurdles to making this work (as you indicate, 
since the parallel primitives just happen to be to held in nested objects), 
with some level of restrictions on the nested objects. I think the levels, most 
restrictive and so perhaps easier to handle in Aparapi first, are something 
like:

* Create a sub-class of Simulation which references specific sub-classes of 
NeuronCollection and SynapseCollection (and perhaps making these Collection 
sub-classes classes final), in this way it is guaranteed that the primitive 
arrays are fixed for the sub-class of Simulation; or

* Specify the specific sub-classes of neuron and synapse collections via 
generics in the Simulation class (again perhaps making these Collection 
sub-classes classes final), in this way it is guaranteed that for a particular 
instance of Simulation that the primitive arrays are fixed.

I suppose it depends on whether the primitive arrays (or arrays of objects 
containing primitive fields) need to be bounded for the kernel class or a 
kernel instance.

Cheers,
Oliver

Original comment by oliver.c...@gmail.com on 3 Oct 2012 at 1:40

GoogleCodeExporter commented 9 years ago

Bump... my project is basically stalled on this issue (and/or the multiple 
entry point Kernel issue, see above). Is this being actively worked on? If not, 
are there any definite plans to work on it?
Cheers,
Oliver

Original comment by oliver.c...@gmail.com on 8 Feb 2013 at 12:09

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Here is a self-contained example of what would be a typical operation that 
needs unnecessary transfers, see comments to indicate where the "problem" is:

import java.util.Random;
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;

/**
 * Typical gaussian convolution on an image
 */
public class GaussianKernel extends Kernel {
    /**
     * Runs a http://en.wikipedia.org/wiki/Difference_of_Gaussians
     * @param args
     */
    public static void main(String args[]){
        // create random noise input
        int w = 50;
        int h = 50;
        float sourceImg[] = new float[w*h];
        Random rnd = new Random();
        for (int i = 0; i < sourceImg.length; i++) {
            sourceImg[i] = rnd.nextInt(256);
        }

        // setup gaussian gpu kernels
        GaussianKernel t1 = new GaussianKernel(100, 100, 100);
        t1.setSource(sourceImg, w, h);
        GaussianKernel t2 = new GaussianKernel(100, 100, 100);
        t2.setSource(sourceImg, w, h);

        // do 2 filterings
        float[] result1 = t1.filter(5);
        float[] result2 = t2.filter(10);

        // subtract the two
        SubtractKernel sub = new SubtractKernel(result1.length);

        // PROBLEM: result1 and result2 had to transfer from GPU, 
        // will now be transfered back (to be subtracted), then the
        // final results come back from GPU again (4 extra transfers)
        // Is there any elegant way to bypass those transfers?
        float[] finalResult = sub.subtract(result1, result2);

        // print out difference of gaussians
        for (int i = 0; i < finalResult.length; i++) {
            System.out.println(result1[i] +","+result2[i] +","+finalResult[i]);
        }
    }
    final int maxWidth;
    final int maxHeight;
    final float[] width;
    final float[] height;
    final float[] radius;
    final float[] pass;

    final float[] gaussianKernel;
    final float[] midPixels;
    final float[] img;
    final float[] result;

    public GaussianKernel(int maxKernelSize, int maxW, int maxH) {
        this.maxWidth = maxW;
        this.maxHeight = maxH;
        this.gaussianKernel = new float[maxKernelSize];
        this.img = new float[maxW*maxH];
        this.midPixels = new float[maxW*maxH];
        this.result = new float[maxW*maxH];
        this.width = new float[1];
        this.height = new float[1];
        this.radius = new float[1];
        this.pass = new float[1];
    }
    public void setSource(float[] indata, int w, int h) {
        this.width[0] = w;
        this.height[0] = h;
        put(this.width);
        put(this.height);
        for (int i = 0; i < h; i++) {
            System.arraycopy(indata, i*w, img, i*this.maxWidth, w);
        }
        put(img);
    }

    public float[] filter(int rad) {
        Range imgRange = Range.create2D(maxWidth, maxHeight);

        radius[0] = rad;
        put(radius);

        float[] inKernel = createGaussianKernel(rad);
        System.arraycopy(inKernel, 0, this.gaussianKernel, 0, inKernel.length);
        put(gaussianKernel);

        // do horizontal filter
        pass[0] = 0;
        put(pass);
        execute(imgRange);

        // do vertical filter
        pass[0] = 1;
        put(pass);
        execute(imgRange);

        // get the results
        get(result);

        return result;
    }

    public void run() {

        int x = getGlobalId(0);
        int y = getGlobalId(1);
        int mxW = getGlobalSize(0);

        float v = 0;
        int rad = (int)radius[0];
        int w = (int)width[0];
        int h = (int)height[0];
        float blurFactor = 0f;
        int off = 0;
        float sample = 0f;
        if (x<w && y<h) {
            // horizontal pass
            if (pass[0]<0.5) {
                for (int i = -rad; i <= rad; i++) {
                    int subOffset = x + i;
                    if (subOffset < 0)
                        subOffset = 0;
                    else if (subOffset >= w)
                        subOffset = w-1;

                    sample = img[y*mxW + subOffset];
                    off = i;
                    if (off<0)
                        off = -off;
                    blurFactor = gaussianKernel[off];
                    v += blurFactor * sample;
                }
                midPixels[y*mxW+x] = v;
            }
            // vertical pass
            else {
                for (int i = -rad; i <= rad; i++) {
                    int subOffset = y + i;
                    if (subOffset < 0)
                        subOffset = 0;
                    else if (subOffset >= h)
                        subOffset = h-1;

                    sample = midPixels[subOffset*mxW + x];
                    off = i;
                    if (off<0)
                        off = -off;
                    blurFactor = gaussianKernel[off];
                    v += blurFactor * sample;
                }
                result[y*mxW+x] = v*0.5f;
            }
        }
        else
            result[y*mxW+x] = 0;
    }

    public float[] createGaussianKernel(int radius) {
        if (radius < 1) {
            throw new IllegalArgumentException("Radius must be >= 1");
        }

        float[] data = new float[radius + 1];

        float sigma = radius / 3.0f;
        float twoSigmaSquare = 2.0f * sigma * sigma;
        float sigmaRoot = (float) Math.sqrt(twoSigmaSquare * Math.PI);
        float total = 0.0f;

        for (int i = 0; i <= radius; i++) {
            float distance = i * i;
            data[i] = (float) Math.exp(-distance / twoSigmaSquare) / sigmaRoot;
            total += data[i];
        }

        for (int i = 0; i < data.length; i++) {
            data[i] /= total;
        }

        return data;
    }
}
/**
 * Simple kernel, subtracts the two values
 */
class SubtractKernel extends Kernel {
    final float[] left;
    final float[] right;
    final float[] result;
    public SubtractKernel(int size) {
        left = new float[size];
        right = new float[size];
        result = new float[size];
    }
    public float[] subtract(float[] a, float[] b) {
        assert(a.length==b.length);

        Range imgRange = Range.create(a.length);

        System.arraycopy(a, 0, left, 0, a.length);
        System.arraycopy(b, 0, right, 0, b.length);

        put(left);
        put(right);

        execute(imgRange);

        get(result);

        return result;
    }
    @Override
    public void run() {
        int x = getGlobalId(0);
        result[x] = right[x] - left[x]; 
    }
}

Original comment by kris.woo...@gmail.com on 8 Feb 2013 at 1:26

GoogleCodeExporter commented 9 years ago

Any progress on this? Please...?

Original comment by oliver.c...@gmail.com on 29 Apr 2014 at 5:26

GoogleCodeExporter commented 9 years ago

I'm also curious about any progress.  It's a showstopper for us adopting 
Aparapi on any real production / larger scale.

Original comment by kris.woo...@gmail.com on 29 Apr 2014 at 5:39

GoogleCodeExporter commented 9 years ago

I have to admit that I have not been working on this. For this to work we would 
need to have to break the relationship between Kernel and buffers.  At present 
the lifetime of a buffer is tied to a kernel.  For this to be implemented we 
would have to expose some cross kernel layer (akin to OpenCL's Context). 

So instead of 

int[] buf //= ....
kernel1.put(buf)
kernel1.execute(range);
kernel1.get(buf);
kernel1.dispose(); // cleanup buf mapping here

We would need to expose the relationship between kernels.  Maybe by having a 
context

Context c = new Context();
int[] buf //= ....
c.put(buf)
kernel1.execute(c,range);
kernel2.execute(c,range);
c.get(buf);
kernel1.dispose();
kernel2.dispose();
c.dispose(); // really cleanup the buffers

Original comment by frost.g...@gmail.com on 29 Apr 2014 at 5:18

rzel / aparapi

Exchange variables between kernels without get&put on host memory #56