clEnqueueNDRangeKernel #779 breaks getting profiling info

psyhtest commented 8 years ago

I repeatedly see a peculiar issue when benchmarking AlexNet on Chromebook 2.

According to my profiling tool, Caffe enqueues 19598 kernel instances. The enqueue number 779 is as follows (ViennaCL's debug output mixed with my tool's output):

ViennaCL: Queue handle 0x5b1840
ViennaCL: Starting 2D/3D-kernel '_prod_TT'...
ViennaCL: Global work size: '48, 184, 0'...
ViennaCL: Local work size: '8, 8, 0'...
[dv/dt] clEnqueueNDRangeKernel
[dv/dt] clEnqueueNDRangeKernel name _prod_TT
[dv/dt] clEnqueueNDRangeKernel queue 0x5b1840
[dv/dt] clEnqueueNDRangeKernel kernel 0x1a10500
[dv/dt] clEnqueueNDRangeKernel offset 0 0 0
[dv/dt] clEnqueueNDRangeKernel gws 48 184 1
[dv/dt] clEnqueueNDRangeKernel lws 8 8 1
[dv/dt] clEnqueueNDRangeKernel event_wait_list
[dv/dt] clEnqueueNDRangeKernel event 0xbea4bf90
[dv/dt] clEnqueueNDRangeKernel start 2016-04-06T08:29:32.356747
[dv/dt] clEnqueueNDRangeKernel output profiling info error: -5
[dv/dt] clEnqueueNDRangeKernel profiling 0 132047486864 9348329517 13737315388583051264
[dv/dt] clEnqueueNDRangeKernel end 2016-04-06T08:29:32.358302
[dv/dt] clEnqueueNDRangeKernel errcode 0
ViennaCL: Kernel _prod_TT finished with status -59!

You can see that the first number after [dv/dt] clEnqueueNDRangeKernel profiling, the "queued" profiling info timestamp, here is 0. What's interesting is that just before this enqueue the profiling output looks perfectly normal. Before this enqueue both my tool and ViennaCL return 0 (CL_SUCCESS); for this enqueue and all enqueues thereafter, my tool returns "-5" and ViennaCL returns ("-59"). (My tools inserts clEventWaitForEvents() and clGetEventProfilingInfo() after each enqueue; ViennaCL similarly inserts clFinish() and clGetEventInfo().)

This enqueue is not the first one for which the _prod_TT kernel is called. However, it is the first one called with these work sizes:

ViennaCL: Global work size: '48, 184, 0'...
ViennaCL: Local work size: '8, 8, 0'...

(followed by 255 more instances of _prod_TT with exactly the same work sizes).

Am I right in thinking that this kernel (matrix-matrix multiplication where both matrices are transposed?) comes from ViennaCL? If so, it may be more appropriate to ask this question there. But I just want to double check that it is not something already known or obvious. Many thanks!

naibaf7 commented 8 years ago

@psyhtest I have not seen this issue before, though it looks quite interesting. I suppose you use Caffe with clBLAS disabled? In this case, it is indeed a ViennaCL GEMM kernel. It seems the error is linked to CL_OUT_OF_RESOURCES. Not quite sure what though. I would assume the work group sizes are too big for the GPU.

I am currently working on a convolution implementation that doesn't need a convolution buffer and can be configured to use very little resources. But I am going to ISBI next week, so that project won't be released for another few weeks.

Until then I suggest trying to run some ViennaCL examples for GEMM and see if there is any obvious limits where the device gets exhausted. Maybe it's also the lazily allocated convolution buffer. https://sourceforge.net/p/viennacl/mailman/message/27016751/

psyhtest commented 8 years ago

@naibaf7 I indeed suspect out of memory. On a Samsung Chromebook where I run my experiments, I could see free memory going from 1.6 GB to 50 MB. What's weird is that the failure always happens at this magic enqueue 779. I would expect out of memory to behave more randomly. I'll try running this ViennaCL kernel standalone with exactly the same parameters when I have time.

P.S. If you are curious what I'm up to, have a look at the abstract of my technical talk at IWOCL next week: http://bit.ly/ck-cnn. You will see I refer to and acknowledge your work there ;).

naibaf7 commented 8 years ago

@psyhtest Well to me it seems like this enqueue 779 succeeds a previous lazy memory allocation. And it seems the OpenCL implementation itself also allocates lazily, meaning the OpenCL buffers only get allocated when the first actual compute kernel tries to access it. That's probably why it fails exactly there every time.

Good work, very nice paper. I like it. Let me know if you want to cooperate on my upcoming OpenCL-libdnn project, which could reduce memory consumption significantly. They also work with a built-in GEMM, meaning it should be easier to optimize and autotune than clBLAS.

psyhtest commented 8 years ago

@naibaf7

I believe this issue was indeed due to insufficient memory when running with the batch size of 128, as I don't observe it when running with the batch sizes of up to 32. Closing.

naibaf7 / caffe

clEnqueueNDRangeKernel #779 breaks getting profiling info #26