Provide work-around for OS X OpenCL bug

GoogleCodeExporter commented 9 years ago

There appears to be a bug in Apple's OpenCL implementation that refuses to 
correctly recognize valid work group sizes.

For example:

!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 128, localSize[0] = 16
after clEnqueueNDRangeKernel, globalSize[1] = 128, localSize[1] = 32
Dec 14, 2012 3:45:26 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###

The suggested fix, as found on Google, is to add "-w 64" to the OpenCL compiler 
options. One suggested reason I found for this error is because Apple's OpenCL 
implementation multiplies any -w by 4 regardless of whether or not the 
resulting value is valid.

I believe any option we may choose to do this would introduce some form of 
platform and version dependency, but I'm not sure how we'll solve this without 
that.

Original issue reported on code.google.com by ryan.lam...@gmail.com on 15 Dec 2012 at 12:10

GoogleCodeExporter commented 9 years ago

Just some observations...this may actually be a bug in Aparapi as well?

If I change to trunk code to the following:

// -----------
// fix for Mac OSX CPU driver (and possibly others) which fail to give correct 
maximum work group info
// while using clGetDeviceInfo
// see: http://www.openwall.com/lists/john-dev/2012/04/10/4

    size_t local = 16;

// status = clGetKernelWorkGroupInfo(jniContext->kernel, 
(cl_device_id)jniContext->deviceId, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), 
&local, NULL);

if (status != CL_SUCCESS) {
    PRINT_CL_ERR(status, "clGetKernelWorkGroupInfo()");
} else {
    range.localDims[0] = range.localDims[0] > local ? local : range.localDims[0];
}
// ------ end fix

range.globalDims[0] = 64;

Which overrides the first global dimension passed to OpenCL by Aparapi I 
receive the following output:

!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 64, localSize[0] = 16
after clEnqueueNDRangeKernel, globalSize[1] = 128, localSize[1] = 32
Dec 14, 2012 5:29:39 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###

What that tells me is that Aparapi is possibly incorrectly checking for an 
setting the global and local sizes for individual dimensionals (for 
multi-dimensional kernels)

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 1:48

GoogleCodeExporter commented 9 years ago

Sorry for all of the typos :(

In summary, it appears that the Aparapi C++ code is only getting or setting 
localDims[0] or globalDims[0] even for multi-dimensional kernels, a little 
farther down from where I modified the code above.

I wonder if we should investigate where all localDims and globalDims are 
getting set, make sure all three dimensionals are being set correctly and then 
decide if we should have an "if platform = OS X and OS version < 10.8" then set 
all globalDims and localDims appropriately.

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 1:55

GoogleCodeExporter commented 9 years ago

Thanks for the link Ryan.  What output does cltest give for you (cd 
com.amd.aparapi.jni; ant cltest; ./cltest_x86_64)  

Here is mine (MacBookPro)

Device 1{
         CL_DEVICE_TYPE..................... GPU (0x0) 
         CL_DEVICE_MAX_COMPUTE_UNITS........ 2
         CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3
             dim[0] = 1024
             dim[1] = 1024
             dim[2] = 64
         CL_DEVICE_MAX_WORK_GROUP_SIZE...... 1024
         CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 268435456
         CL_DEVICE_GLOBAL_MEM_SIZE.......... 1073741824
         CL_DEVICE_LOCAL_MEM_SIZE........... 49152
         CL_DEVICE_PROFILE.................. FULL_PROFILE
         CL_DEVICE_VERSION.................. OpenCL 1.1 
         CL_DRIVER_VERSION.................. CLH 1.0
         CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.1 
         CL_DEVICE_NAME..................... GeForce GT 650M
         CL_DEVICE_EXTENSIONS............... cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_APPLE_fp64_basic_ops 
      }

The link (thanks) implies that the result of  CL_DEVICE_MAX_WORK_GROUP_SIZE for 
each dimension cannot be trusted. 

So previously (if you recall) we tried to calculate this from the Java side, 
now (if I understand it correct) we actually query the device - which may lie?. 
 Is this the hypothesis? Is it worth 'backing out' the patch for querying the 
device. 

Sorry still a little confused. 

gary

Original comment by frost.g...@gmail.com on 15 Dec 2012 at 2:12

GoogleCodeExporter commented 9 years ago

I'm currently working on trying to figure this out as well, although I have to 
call it quits for the night soon. Sorry if my code snippets above are confusing.

It does appear that the OpenCL runtime is returning potentially valid results 
for clGetKernelWorkGroupInfo (valid looking power of 2), but using those 
results directly when calling clEnqueueNDRangeKernel is failing on Apple only.

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 2:34

GoogleCodeExporter commented 9 years ago

evice 1{
         CL_DEVICE_TYPE..................... GPU (0x0) 
         CL_DEVICE_MAX_COMPUTE_UNITS........ 2
         CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3
             dim[0] = 512
             dim[1] = 512
             dim[2] = 64
         CL_DEVICE_MAX_WORK_GROUP_SIZE...... 512
         CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 134217728
         CL_DEVICE_GLOBAL_MEM_SIZE.......... 268435456
         CL_DEVICE_LOCAL_MEM_SIZE........... 16384
         CL_DEVICE_PROFILE.................. FULL_PROFILE
         CL_DEVICE_VERSION.................. OpenCL 1.0 
         CL_DRIVER_VERSION.................. CLH 1.0
         CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.0 
         CL_DEVICE_NAME..................... GeForce 9400M
         CL_DEVICE_EXTENSIONS............... cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
      }
   }
}

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 2:37

GoogleCodeExporter commented 9 years ago

Weird...I'm using OS X 10.7.5...OpenCL 1.0?

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 2:43

GoogleCodeExporter commented 9 years ago

If I do something like this for grins:

size_t max_workgroup_size;

status = clGetKernelWorkGroupInfo(jniContext->kernel, 
(cl_device_id)jniContext->deviceId, CL_KERNEL_WORK_GROUP_SIZE, 
sizeof(max_workgroup_size), &max_workgroup_size, NULL);

fprintf(stderr, "max_workgroup_size: %d \n", max_workgroup_size);
fprintf(stderr, "Before range.localDims:\n %d %d %d \n", 
range.localDims[0],range.localDims[1],range.localDims[2]);

if (status != CL_SUCCESS) {
    PRINT_CL_ERR(status, "clGetKernelWorkGroupInfo()");
} else {
    range.localDims[0] = 16;
    range.localDims[1] = 16;
    range.localDims[2] = 16;
}
// ------ end fix

fprintf(stderr, "After range.localDims:\n %d %d %d \n", 
range.localDims[0],range.localDims[1],range.localDims[2]);

The OS X test will complete execution, but will return invalid results. Just 
for informational purposes, the value of max_workgroup_size is 256.

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 2:51

GoogleCodeExporter commented 9 years ago

So the "Apple incorrectly multiplies group size by 4 behind the scenes" appears 
to be correct.

I created a test which required the following:

range.localDims:
16 32 100

I had to modify the source code to do the following:

} else {
    range.localDims[0] = 4;
    range.localDims[1] = 8;
    range.localDims[2] = 100;
}

The range outputs as the following:

range.localDims:
4 8 100

Which then proceeds to execute correctly as 16, 32, 100. That's annoying.

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 3:01

GoogleCodeExporter commented 9 years ago

My guess is that 256 is the correct max workgroup size. But Apple probably has 
a simple typo somewhere and is doing the following:

16*4*4 = 256
32*4*4 = 512 (oops!)

But the following works:

(16/4)*4*4 = 128
(32/4)*4*4 = 256

Which is all I did above to eliminate the incorrect multiply.

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 3:04

GoogleCodeExporter commented 9 years ago

Sorry, it's late...256 would be the correct clGetKernelWorkGroupInfo whereas 
512 is the device maximum (which should apparently be ignored)

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 3:08

GoogleCodeExporter commented 9 years ago

Except if I divide 100 by 4 to set range.localDims[2] = 25 then I get incorrect 
results again....

Original comment by ryan.lam...@gmail.com on 15 Dec 2012 at 3:16

jordan30001 / aparapi

Provide work-around for OS X OpenCL bug #86