CL_OUT_OF_RESOURCES on GC2000+

Issue summary

Hi,

I'm trying to get caffe to work on a i.MX6QP+ which has a GC2000+ (Full Profile), I modified the source code in ocl_device_program.cpp to remove the "vect_type_hint" as the driver seems to fail compilation if this is in the kernel

  case KERNEL_HINT_VEC_TYPE:
    /*ss << "__attribute__((vec_type_hint(" << std::get<1>(hints[i])
       << ")))" << std::endl;*/
    break;

After thet the kernel does compile but I get a failure when trying to execute:

CL_OUT_OF_RESOURCES

while trying to "Perform forward"

of this model:

https://github.com/xingwangsfu/caffe-yolo/blob/master/prototxt/yolo_tiny_deploy.prototxt

Steps to reproduce

I'm using this board: https://www.amazon.com/Code-Modules-Inc-PixieBoard-Computing/dp/B07DQBPNZT/ref=sr_1_1?ie=UTF8&qid=1529860144&sr=8-1&keywords=code+and+modules+pixiepro

With kernel 4.9.109

I get this out of clinfo

clinfo: /usr/lib/libOpenCL.so.1: no version information available (required by clinfo) Number of platforms 1 Platform Name Vivante OpenCL Platform Platform Vendor Vivante Corporation Platform Version OpenCL 1.2 V6.2.4.p1.150331 Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix viv

Platform Name Vivante OpenCL Platform Number of devices 1 Device Name Vivante OpenCL Device GC2000+.5450.0000 Device Vendor Vivante Corporation Device Vendor ID 0x564956 Device Version OpenCL 1.2 Driver Version OpenCL 1.2 V6.2.4.p1.150331 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 4 Max clock frequency 500MHz Device Partition (core) Max number of sub-devices 0 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 1024x1024x1024 Max work group size 1024 === CL_PROGRAM_BUILD_LOG === (6:0) : error : syntax error at 'kernel' Preferred work group size multiple <getWGsizes:1200: create kernel : error -45> Preferred / native vector sizes
char 4 / 4
short 4 / 4
int 4 / 4
long 4 / 4
half 0 / 0 (n/a) float 4 / 4
double 0 / 0 (n/a) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals No Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (n/a) Address bits 32, Little-Endian Global memory size 268435456 (256MiB) Error Correction support Yes Max memory allocation 134217728 (128MiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size 8192 (8KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 65536 pixels Max 1D or 2D image array size 8192 images Max 2D image size 8192x8192 pixels Max 3D image size 8192x8192x8192 pixels Max number of read image args 128 Max number of write image args 8 Local memory type Global Local memory size 32768 (32KiB) Max number of constant args 9 Max constant buffer size 65536 (64KiB) Max size of kernel argument 1024 Queue properties
Out-of-order execution Yes Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1000ns Execution capabilities
Run OpenCL kernels Yes Run native kernels No printf() buffer size 1048576 (1024KiB) Built-in kernels (n/a) Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_gl_sharing

NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [viv] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name Vivante OpenCL Platform Device Name Vivante OpenCL Device GC2000+.5450.0000 clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name Vivante OpenCL Platform Device Name Vivante OpenCL Device GC2000+.5450.0000 clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name Vivante OpenCL Platform Device Name Vivante OpenCL Device GC2000+.5450.0000

The GPU has 768MiB of assigned RAM, I can assign more but this doesn't seem to make an effect. and while monitoring the memory it never gets to a point where is close to OOM

running this command:

caffe time --model models/yolo/yolo_tiny_deploy.prototxt -gpu 0 |& tee error.log

Tried solutions

I tried setting different max_work_item sizes in ocl_device, hardcoded to 256 in each dim

System configuration

OS:Linux 4.9.109 System:PixiePro+ RAM=4GiB

Thanks!

naibaf7 / caffe