Open GoogleCodeExporter opened 9 years ago
Just some observations...this may actually be a bug in Aparapi as well?
If I change to trunk code to the following:
// -----------
// fix for Mac OSX CPU driver (and possibly others) which fail to give correct
maximum work group info
// while using clGetDeviceInfo
// see: http://www.openwall.com/lists/john-dev/2012/04/10/4
size_t local = 16;
// status = clGetKernelWorkGroupInfo(jniContext->kernel,
(cl_device_id)jniContext->deviceId, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local),
&local, NULL);
if (status != CL_SUCCESS) {
PRINT_CL_ERR(status, "clGetKernelWorkGroupInfo()");
} else {
range.localDims[0] = range.localDims[0] > local ? local : range.localDims[0];
}
// ------ end fix
range.globalDims[0] = 64;
Which overrides the first global dimension passed to OpenCL by Aparapi I
receive the following output:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 64, localSize[0] = 16
after clEnqueueNDRangeKernel, globalSize[1] = 128, localSize[1] = 32
Dec 14, 2012 5:29:39 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
What that tells me is that Aparapi is possibly incorrectly checking for an
setting the global and local sizes for individual dimensionals (for
multi-dimensional kernels)
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 1:48
Sorry for all of the typos :(
In summary, it appears that the Aparapi C++ code is only getting or setting
localDims[0] or globalDims[0] even for multi-dimensional kernels, a little
farther down from where I modified the code above.
I wonder if we should investigate where all localDims and globalDims are
getting set, make sure all three dimensionals are being set correctly and then
decide if we should have an "if platform = OS X and OS version < 10.8" then set
all globalDims and localDims appropriately.
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 1:55
Thanks for the link Ryan. What output does cltest give for you (cd
com.amd.aparapi.jni; ant cltest; ./cltest_x86_64)
Here is mine (MacBookPro)
Device 1{
CL_DEVICE_TYPE..................... GPU (0x0)
CL_DEVICE_MAX_COMPUTE_UNITS........ 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3
dim[0] = 1024
dim[1] = 1024
dim[2] = 64
CL_DEVICE_MAX_WORK_GROUP_SIZE...... 1024
CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 268435456
CL_DEVICE_GLOBAL_MEM_SIZE.......... 1073741824
CL_DEVICE_LOCAL_MEM_SIZE........... 49152
CL_DEVICE_PROFILE.................. FULL_PROFILE
CL_DEVICE_VERSION.................. OpenCL 1.1
CL_DRIVER_VERSION.................. CLH 1.0
CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.1
CL_DEVICE_NAME..................... GeForce GT 650M
CL_DEVICE_EXTENSIONS............... cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_APPLE_fp64_basic_ops
}
The link (thanks) implies that the result of CL_DEVICE_MAX_WORK_GROUP_SIZE for
each dimension cannot be trusted.
So previously (if you recall) we tried to calculate this from the Java side,
now (if I understand it correct) we actually query the device - which may lie?.
Is this the hypothesis? Is it worth 'backing out' the patch for querying the
device.
Sorry still a little confused.
gary
Original comment by frost.g...@gmail.com
on 15 Dec 2012 at 2:12
I'm currently working on trying to figure this out as well, although I have to
call it quits for the night soon. Sorry if my code snippets above are confusing.
It does appear that the OpenCL runtime is returning potentially valid results
for clGetKernelWorkGroupInfo (valid looking power of 2), but using those
results directly when calling clEnqueueNDRangeKernel is failing on Apple only.
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 2:34
evice 1{
CL_DEVICE_TYPE..................... GPU (0x0)
CL_DEVICE_MAX_COMPUTE_UNITS........ 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3
dim[0] = 512
dim[1] = 512
dim[2] = 64
CL_DEVICE_MAX_WORK_GROUP_SIZE...... 512
CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 134217728
CL_DEVICE_GLOBAL_MEM_SIZE.......... 268435456
CL_DEVICE_LOCAL_MEM_SIZE........... 16384
CL_DEVICE_PROFILE.................. FULL_PROFILE
CL_DEVICE_VERSION.................. OpenCL 1.0
CL_DRIVER_VERSION.................. CLH 1.0
CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.0
CL_DEVICE_NAME..................... GeForce 9400M
CL_DEVICE_EXTENSIONS............... cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics
}
}
}
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 2:37
Weird...I'm using OS X 10.7.5...OpenCL 1.0?
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 2:43
If I do something like this for grins:
size_t max_workgroup_size;
status = clGetKernelWorkGroupInfo(jniContext->kernel,
(cl_device_id)jniContext->deviceId, CL_KERNEL_WORK_GROUP_SIZE,
sizeof(max_workgroup_size), &max_workgroup_size, NULL);
fprintf(stderr, "max_workgroup_size: %d \n", max_workgroup_size);
fprintf(stderr, "Before range.localDims:\n %d %d %d \n",
range.localDims[0],range.localDims[1],range.localDims[2]);
if (status != CL_SUCCESS) {
PRINT_CL_ERR(status, "clGetKernelWorkGroupInfo()");
} else {
range.localDims[0] = 16;
range.localDims[1] = 16;
range.localDims[2] = 16;
}
// ------ end fix
fprintf(stderr, "After range.localDims:\n %d %d %d \n",
range.localDims[0],range.localDims[1],range.localDims[2]);
The OS X test will complete execution, but will return invalid results. Just
for informational purposes, the value of max_workgroup_size is 256.
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 2:51
So the "Apple incorrectly multiplies group size by 4 behind the scenes" appears
to be correct.
I created a test which required the following:
range.localDims:
16 32 100
I had to modify the source code to do the following:
} else {
range.localDims[0] = 4;
range.localDims[1] = 8;
range.localDims[2] = 100;
}
The range outputs as the following:
range.localDims:
4 8 100
Which then proceeds to execute correctly as 16, 32, 100. That's annoying.
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 3:01
My guess is that 256 is the correct max workgroup size. But Apple probably has
a simple typo somewhere and is doing the following:
16*4*4 = 256
32*4*4 = 512 (oops!)
But the following works:
(16/4)*4*4 = 128
(32/4)*4*4 = 256
Which is all I did above to eliminate the incorrect multiply.
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 3:04
Sorry, it's late...256 would be the correct clGetKernelWorkGroupInfo whereas
512 is the device maximum (which should apparently be ignored)
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 3:08
Except if I divide 100 by 4 to set range.localDims[2] = 25 then I get incorrect
results again....
Original comment by ryan.lam...@gmail.com
on 15 Dec 2012 at 3:16
Original issue reported on code.google.com by
ryan.lam...@gmail.com
on 15 Dec 2012 at 12:10