nativelibs4java / nativelibs4java

Repository for all NativeLibs4Java projects : JavaCL, ScalaCL, BridJ, JNAerator...
http://code.google.com/p/nativelibs4java/
291 stars 83 forks source link

ReductionUtils.Reductor returns wrong result when input data size is greater than 4096 #28

Closed ochafik closed 13 years ago

ochafik commented 13 years ago

What steps will reproduce the problem? 1.Create array which size is 4097 (type of array doesn't affect the bug, for example int[] or float[])

2.Initialize it with 1

3.Make necessary JavaCL Initialization

  1. ReductionUtils.Reductor reductor = ReductionUtils.createReductor(context, ReductionUtils.Operation.Add, ReductionUtils.Type.Float, 1);
  2. FloatBuffer result = reductor.reduce(queue, clBufferInput, 4097, 64);

What is the expected output? What do you see instead? ->Expected output = 4097.0 || Real output = 8192.0 ->Also if input data size is very big, like 82369, than it throws: "CLException$InvalidValue: InvalidValue"

What version of the product are you using? On what operating system?

->I'm using "javacl-1.0-beta-4-shaded.jar" on Windows 7, Nvidia Quadro FX 580

I attached .java which was used for tests. Changing MaxReductionSize doesn't affect results. If input data size is 4096 or lower it works fine.

Google Code Info: Issue #: 26 Author: roman.ry...@gmail.com Created On: 2010-06-04T01:35:07.000Z Closed On: 2010-12-01T12:00:02.000Z

ochafik commented 13 years ago

Well I'm still investigating, but it might be device memory problems, because I have similar problems with new kernels and I forgot to do manual release of used objects while debugging.

Now I've released some objects, but problem persists. Maybe it's because I've run many kernels without releases and their code was deleted.

Is there some way to clear device memory completely?

Google Code Info: Author: roman.ry...@gmail.com Created On: 2010-06-04T12:22:04.000Z

ochafik commented 13 years ago

Hi Roman,

Interesting issue...

Maybe writing zeroes on a huge CLByteBuffer allocated on that device with a size close to the announced device memory size (in a big try block) might do the trick, but I rather suspect a bug in the reduction algorithm (maybe not checking the upper execution bound correctly... I think this must be done in the source code as well as in the enqueueNDRange command).

Cheers

Google Code Info: Author: olivier.chafik Created On: 2010-06-04T14:34:52.000Z

ochafik commented 13 years ago

Please try to reproduce the bug, as it may be only my device causing this problem.

Thanks

Google Code Info: Author: roman.ry...@gmail.com Created On: 2010-06-06T15:02:48.000Z

ochafik commented 13 years ago

Hi Roman,

I couldn't reproduce the bug with the following code on my Mac (snow leopard) with the current SVN version of JavaCL :

{{{ @Test public void testIssue26() { try { float[] array = new float[4097]; for (int i = 0; i < array.length; i++) array[i] = 1;

        CLFloatBuffer clBufferInput = context.createFloatBuffer(CLMem.Usage.Input, FloatBuffer.wrap(array), true);

        ReductionUtils.Reductor<FloatBuffer> reductor = ReductionUtils.createReductor(
            context, 
            ReductionUtils.Operation.Add, 
            ReductionUtils.Type.Float, 
            1
        );
        FloatBuffer result = reductor.reduce(queue, clBufferInput, 4097, 64);
        float sum = result.get(0);
        float expected = 4097;
        System.err.println("[Test of issue 26] Expected " + expected + ", got " + sum);
        assertEquals(expected, sum, 0);
    } catch (Exception ex) {
        ex.printStackTrace();
        assertTrue(ex.toString(), false);
    }
}

}}}

(this test will be automatically run by "mvn test" since revision #995)

Could you please try this code on your setup ? And if it still fails, can you try it with the latest SVN version of JavaCL ?

Thanks,

Cheers

Olivier

Google Code Info: Author: olivier.chafik Created On: 2010-06-09T21:16:56.000Z

ochafik commented 13 years ago

Hi. I was using JavaCL for an academic project and I didn't have time to "play" with this bug. Now I don't have any OpenCL capable device, so if no one had same problem you can close it...

Sorry for wasting your time.

Google Code Info: Author: roman.ry...@gmail.com Created On: 2010-08-28T17:19:15.000Z

ochafik commented 13 years ago

Hi Roman,

No worries, thanks for the update :-) Cheers

Olivier

Google Code Info: Author: olivier.chafik Created On: 2010-08-28T17:57:20.000Z

ochafik commented 13 years ago

Hello Olivier,

I have suddenly stumbled upon a problem which might be the same issue Roman encountered.

I have adapted your reduction code so that it can perform two operations on two different buffers at the same time. My changes are very straightforward, you'll have no problem recognizing anything in my test case. This is why I do kind of assume that the bug which causes my test to fail is actually a bug in the ReductionUtils class.

I have two arrays, A and B, length 4097, and the attached code is supposed to compute the sum of each of them. Upon inspecting the result and playing with the input numbers, it seems that the sum for B doesn't contain the numbers B[0] to B[63], but it does contain A[4096]! (The sum for A is always correct.)

I don't really understand the implementation, but I am fairly sure that the addition of the second array should work as expected (and I will be embarrassed if I'm mistaken).

A=[1, 1, 1, ..., 1, 3.1415] B=[10000, 10000, ..., 10000 (index 63), 0, 0, ..., 0]

Correct sumA=4099.141602, sumB=640000.000000 Reduction result sumA=4099.141602, sumB=3.141500

Csaba

Google Code Info: Author: kazocs...@gmail.com Created On: 2010-11-29T15:32:43.000Z

ochafik commented 13 years ago

I believe I have found a problem with the reduction code.

The temporary buffers created on line 213 of ReductionUtils are too small. The kernels write the positions (globalId/blockLength) = (globalId/maxReductionSize), and globalId ranges from 0 to (inputLength-1). So the size of the buffers should be (inputLength-1)/maxReductionSize, but currently they are only maxReductionSize long. (I ignore the valueChannels value, don't really understand what it does, I assume it is 1.)

Having understood how this little algorithm works in order to find the issue, I have come up with something I don't understand. The reduction kernel operates on a specific 'block', but this block=globalId/blockLength, which means that series of 'blockLength' work items will operate on the same input and write to the same output slot. So it would seem to me that the global work size could be cut down.

Google Code Info: Author: kazocs...@gmail.com Created On: 2010-12-01T10:45:22.000Z

ochafik commented 13 years ago

Hi Csaba,

Thanks a lot for your feedback and work on this issue. The channels are meant for separate computations of packed values components ((float, float), for instance) : each channel gets reduced separately (trying to exploit OpenCL's float2, float4... built-in arithmetics).

You're totally right about the block issue, I'm now completely ashamed of my code ! I've committed what I believe to be a fix (tests still run fine, at least) in revision 1408.

Please let me know if you think it still does not work the way it ought to :-) Thanks again,

Cheers

Olivier

Google Code Info: Author: olivier.chafik Created On: 2010-12-01T12:00:02.000Z

ochafik commented 13 years ago

I think that it works correctly now, great job, Olivier. :)

Although I'm not quite sure why you round up the global work size from blocksInCurrentDepth to the next power of two, and then filter out the excess work with an 'if' condition in the kernel. I don't see how it would improve performance.

Csaba

Google Code Info: Author: kazocs...@gmail.com Created On: 2010-12-01T13:56:54.000Z

ochafik commented 13 years ago

Hum, I'm not that sure anymore about the rounding... I do have issues (nvidia blowing up) if the global work size is 1 though, so I've left the tests but I've removed the power of two thingie : http://code.google.com/p/nativelibs4java/source/detail?r=1411

Thanks !

Olivier

Google Code Info: Author: olivier.chafik Created On: 2010-12-01T14:58:38.000Z