Running sample application fails in clEnqueueNDRangeKernel

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Compile and run the squares sample application on Mac OSX 10.7
2.
3.

What is the expected output? What do you see instead?

I expected it to be executed on the GPU, instead it was executed on the CPU 
after an error in the OpenCL execution. See below for the full error.

What version of the product are you using? On what operating system?

Using the Aparapi download from 2012-05-06 on OSX 10.7.4 with Java version 
1.6.31.
I'm not sure if this is part of the problem but the machine I'm using is a 
MacBook with an AMD Radeon HD 6750M videocard and a built-in Intel HD Graphics 
3000 as well.

Please provide any additional information below.

Here's the output before aparapi reverted to Java:

platform name    0 Apple
platform version 0 OpenCL 1.1 (Apr  9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU 
in setArgs arg 0 val$squares type 00001684
in setArgs arg 0 val$squares is *not* local
in setArgs arg 1 val$values type 00001284
in setArgs arg 1 val$values is *not* local
got type for val$squares: 00001684
testing for Resync javaArray val$squares: old=0x0, new=0x7fc741c2b038
Resync javaArray for val$squares: 0x7fc741c2b038  0x0
NewWeakGlobalRef for val$squares, set to 0x7fc741c27a10
updateNonPrimitiveReferences, args[0].sizeInBytes=2048
got type for val$values: 00001284
testing for Resync javaArray val$values: old=0x0, new=0x7fc741c2b040
Resync javaArray for val$values: 0x7fc741c2b040  0x0
NewWeakGlobalRef for val$values, set to 0x7fc741c27a18
updateNonPrimitiveReferences, args[1].sizeInBytes=2048
back from updateNonPrimitiveReferences
got type for arg 0, val$squares, type=00001684
runKernel: arrayOrBuf ref 0x7fc741c27a10, oldAddr=0x0, newAddr=0x7f312ce08, 
ref.mem=0x0, isArray=1
at memory addr 0x7f312ce08, contents: 00 00 00 00 00 00 00 00 
val$squares 0 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, 
size=00000800 bytes, address=f312ce08, &status)
 writing buffer 0 val$squares
got type for arg 1, val$values, type=00001284
runKernel: arrayOrBuf ref 0x7fc741c27a18, oldAddr=0x0, newAddr=0x7f312c5f8, 
ref.mem=0x0, isArray=1
at memory addr 0x7f312c5f8, contents: 00 00 00 00 00 00 80 3f 
val$values 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, 
size=00000800 bytes, address=f312c5f8, &status)
 writing buffer 1 val$values
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=512 localSize_0=256
28-mei-2012 14:40:54 com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###

Original issue reported on code.google.com by misja.a...@gmail.com on 28 May 2012 at 1:06

GoogleCodeExporter commented 8 years ago

Can you execute clinfo on this machine. I think this might be Apple's driver 
not liking a workgroup/localsize of 256

If you execute clinfo it should tell you maximum group size. I think some Mac 
OSX  machines report 1024, but it is actually 128. 

To test this.  Instead of using 

kernel.execute(yourSize);

Create a range using a fixed local size

Range range = Range.create(yourSize, 128);
kernel.execute(range);

This fixes the buffer size (to 128) rather than using defaults. 

Gary

Original comment by frost.g...@gmail.com on 28 May 2012 at 7:12

GoogleCodeExporter commented 8 years ago

Setting the local size to 128 did fix the issue, thx!

About executing clinfo: Is there a way to do this using aparapi?

Original comment by misja.a...@gmail.com on 29 May 2012 at 9:47

GoogleCodeExporter commented 8 years ago

Glad to hear that setting the localsize manually fixed this. This is obviously 
more of a 'workaround' than a fix, but hopefully it will move you forward.

Regarding executing clinfo from Aparapi.

You cannot currently do this from from any of the binary
downloads.  If you are building from the trunk we offer enough
information in the new Device class to extract the actual the global
and local sizes.

So from trunk code you could use something like

Device device = Device.bestGPU();
Range range = device.createRange(yoursize).

To create a range suitable for a specific device which 'honors' the
limits imposed by the device and OpenCL runtime.

However this is still all relatively new and not quite ready for primetime.

I will keep this open until we create a new binary distribution.

Gary

Original comment by frost.g...@gmail.com on 29 May 2012 at 1:04

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

I built aparapi from the trunk this time, and now I don't even need to set any 
range or local size anymore, because the squares sample app runs without any 
error!
Your trunk code is neatly organized by the way, I could even build it without 
any problems on my Mac.

I checked the max. workgroupsize that is reported by the Device class for my 
gpu, and it is 1024.
(Max. dimensions is 3 and max. workitems is 1024 for every dimension.)
And indeed I can even run the squares demo with a range size of 1024.

Could it be that something else was wrong with the binary distribution, causing 
the error? Maybe it was mixing up my 2 videocards somehow?

Original comment by misja.a...@gmail.com on 30 May 2012 at 7:07

GoogleCodeExporter commented 8 years ago

Hmm... I wish I could claim credit for fixing this... :)

It might be that we are using the device infrastructure underneath.
Did you by any chance update your OpenCL driver.

So the trunk code is different.  It now uses Device.firstGPU() under
the hood (unless the Range was created via another device).

What devices do you have?  Two identical cards?

You may be able to help me test something ;)

If you have a build from the trunk you can do this.

Device device = Device.firstGPU();
Range range = device.createRange(1024);
kernel.execute(range);

Which is what Aparapi is now trying to do by default. Note that by
creating a Range via the device, the Range  is bound to that device
and is 'guaranteed' (well highly likely ;) ) to have compatible
groupsizes.

You can also select the best GPU (which might be the first!)
Device device = Device.best();
Range range = device.createRange(1024);
kernel.execute(range);

Also you can print the device info

if (device instanceof OpenCLDevice){
    System.out.println("vendor ="+
      ((OpenCLDevice)device).getPlatform().getVendor());
}

And can create your own filter (for say picking the first AMD device
;) ) using the DeviceComparitor interface.

Here is the code for 'best'

 public static Device best() {
      return (OpenCLDevice.select(new DeviceComparitor(){
         @Override public OpenCLDevice select(OpenCLDevice _deviceLhs,
OpenCLDevice _deviceRhs) {
            if (_deviceLhs.getType() != _deviceRhs.getType()) {
               if (_deviceLhs.getType() == TYPE.GPU) {
                  return (_deviceLhs);
               } else {
                  return (_deviceRhs);
               }
            }
            if (_deviceLhs.getMaxComputeUnits() >
_deviceRhs.getMaxComputeUnits()) {
               return (_deviceLhs);
            } else {
               return (_deviceRhs);
            }

         }
      }));
   }

Can you try creating a range via a device (as shown above) and
validate that it is working?

Gary

Original comment by frost.g...@gmail.com on 30 May 2012 at 10:02

GoogleCodeExporter commented 8 years ago

No my two cards are different, one is a cpu-integrated Intel card and the other 
an AMD Radeon.

I tried creating the range the way you described. This time I got an error 
again when executing the squares application:

!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1024, localSize[0] = 1024
31-mei-2012 20:23:39 com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1024, localSize[0] = 1024
31-mei-2012 20:23:39 com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1024, localSize[0] = 1024

Same story when I used a range size of 512. With range size 128 it executed 
without errors, just like with the binary distribution :)
I tried it both with Device.best() and Device.firstGpu(). I also printed the 
vendor and name of the device, strangely enough they both give the same output: 
'vendor = Apple'

Original comment by misja.a...@gmail.com on 31 May 2012 at 6:34

GoogleCodeExporter commented 8 years ago

Thanks for help debugging.. Clearly I have some work to do.

Apple is your OpenCL vendor (i.e they supply the OpenCL runtime).

Gary

Original comment by frost.g...@gmail.com on 31 May 2012 at 7:34

GoogleCodeExporter commented 8 years ago

I believe this issue is related to 
http://code.google.com/p/aparapi/issues/detail?id=86

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 12:15

tigerneil / aparapi

Running sample application fails in clEnqueueNDRangeKernel #52