clBuildProgram segfaults when building libDNN kernels on Snapdragon 835

psyhtest commented 7 years ago

I've encountered segfaults in Caffe with libDNN on a Snapdragon 835 powered smartphone.

Caffe succeeds to build the main Greentea kernels:

ViennaCL: Adding new queue for device 0x7f9bdaaac8 to context 0x7f9a648280
ViennaCL: Context no. 0 initialized with 1 devices
ViennaCL: Device id: 0x7f9bdaaac8
I0706 08:49:19.158033 13253 device.cpp:62] CL_DEVICE_HOST_UNIFIED_MEMORY: 1
ViennaCL: Adding program 'kernel_program' with source to context 0x7f9a648280
ViennaCL: clCreateProgramWithSource
ViennaCL: source_text (100 out of 337011 bytes):
#define ENABLE_DOUBLE_SUPPORT
#ifndef __OPENCL_VERSION__
#define __kernel
#define __global
#define _
ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options:
ViennaCL: clBuildProgram returned 0
ViennaCL: clBuildProgram err==CL_SUCCESS
ViennaCL: clCreateKernelsInProgram

After the net is initialized, however, Caffe attempts to build the libDNN kernels and segfaults in clBuildProgram:

I0706 08:49:20.207176 13253 layer_factory.cpp:67] Creating layer input
I0706 08:49:20.207232 13253 net.cpp:96] Creating Layer input
I0706 08:49:20.207242 13253 net.cpp:413] input -> data
I0706 08:49:20.207264 13253 net.cpp:134] Setting up input
I0706 08:49:20.207273 13253 net.cpp:142] Top shape: 1 3 227 227 (154587)
I0706 08:49:20.207284 13253 layer_factory.cpp:67] Creating layer conv1
I0706 08:49:20.207298 13253 net.cpp:96] Creating Layer conv1
I0706 08:49:20.207304 13253 net.cpp:444] conv1 <- data
I0706 08:49:20.207311 13253 net.cpp:413] conv1 -> conv1
I0706 08:49:20.207458 13253 libdnn_conv.cpp:21] LibDNNConv<Dtype>::LibDNNConv::1
I0706 08:49:20.207610 13253 libdnn_conv.cpp:1622] LibDNNConv<Dtype>::GenerateKernels::1
I0706 08:49:20.207998 13253 libdnn_conv.cpp:1635] LibDNNConv<Dtype>::GenerateKernels::2
I0706 08:49:20.208006 13253 libdnn.cpp:218] LibDNN<Dtype>::CompileKernels::1
I0706 08:49:20.208012 13253 libdnn.cpp:229] LibDNN<Dtype>::CompileKernels::2
I0706 08:49:20.208189 13253 libdnn.cpp:236] LibDNN<Dtype>::CompileKernels::3
I0706 08:49:20.208197 13253 libdnn.cpp:241] LibDNN<Dtype>::CompileKernels::4
I0706 08:49:20.208202 13253 libdnn.cpp:258] LibDNN<Dtype>::CompileKernelsOpenCL::1
I0706 08:49:20.208209 13253 libdnn.cpp:272] LibDNN<Dtype>::CompileKernelsOpenCL::2
ViennaCL: Adding program 'kernel_program' with source to context 0x7f9a648280
ViennaCL: clCreateProgramWithSource
ViennaCL: source_text (100 out of 23990 bytes):
#if defined(cl_khr_int32_base_atomics)
#pragma OPENCL EXTENSION cl_khr_int32_base_atomics : enable
#
ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options: -cl-fast-relaxed-math -cl-mad-enable -cl-single-precision-constant
Segmentation fault

psyhtest commented 7 years ago

This can be reproduced as follows.

Install CK-Caffe.

Install the ViennaCL master package:


$ ck install package:lib-viennacl-master-src --target_os=android21-arm64
...
Environment entry added (c3153bc343548550)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64

Installation time: 3.81807088852 sec.

$ ck load env:c3153bc343548550 | grep path_include "path_include": "/home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/src",


2.  Apply the [ViennaCL patch](https://github.com/naibaf7/caffe/files/1127763/issue69.viennacl.patch.txt) to the sources to enable more verbose debug output:

$ patch \ -d /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/src \ -p1 < ~/Downloads/issue69.viennacl.patch.txt patching file viennacl/ocl/context.hpp


3. Apply the [ViennaCL meta patch](https://github.com/naibaf7/caffe/files/1127407/issue69.viennacl-meta.patch.txt) to the metadata to disable ViennaCL kernel caching:

$ ck find env:c3153bc343548550 /home/anton/CK_REPOS/local/env/c3153bc343548550 $ patch \ /home/anton/CK_REPOS/local/env/c3153bc343548550/.cm/meta.json \ ~/Downloads/issue69.viennacl-meta.patch.txt patching file /home/anton/CK_REPOS/local/env/c3153bc343548550/.cm/meta.json


4. Install the Caffe with libDNN+ViennaCL package (using the ViennaCL environment installed in steps 1-3):

$ ck install package:lib-caffe-bvlc-opencl-libdnn-viennacl-universal --target_os=android21-arm64 ... Environment entry added (69031a24319b37f1)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64

Installation time: 198.47616601 sec.

(**NB:** To save time, you can interrupt the installation straight after the cloning.)

5.  Apply the [Greentea patch](https://github.com/naibaf7/caffe/files/1127448/issue69.greentea.patch.txt):

$ patch \ -d /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/src \ -p1 < ~/Downloads/issue69.greentea.patch.txt patching file src/caffe/greentea/libdnn.cpp patching file src/caffe/greentea/libdnn_conv.cpp


6. Rebuild with ViennaCL debug output enabled (answer `y` several times when prompted):

$ ck install package:lib-caffe-bvlc-opencl-libdnn-viennacl-universal \ --target_os=android21-arm64 --rebuild --env.CK_VIENNACL_DEBUG=ON ... Environment entry updated (69031a24319b37f1)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64

Installation time: 230.220377922 sec.

$ ck show env --tags=lib,caffe,vlibdnn Env UID: Target OS: Bits: Name: Version: Tags:

69031a24319b37f1 android21-arm64 64 BVLC Caffe framework (opencl,libdnn,viennacl) master-73221fd 64bits,bvlc,caffe,host-os-linux-64,lib,target-os-android21-arm64,v0,v0.0,vlibdnn,vmaster,vopencl $ ck find env:69031a24319b37f1 /home/anton/CK_REPOS/local/env/69031a24319b37f1


7. Compile the `caffe-time-opencl` program (using the Caffe environment installed in steps 4-6):

$ ck compile program:caffe-time-opencl --target_os=android21-arm64 ... Compilation time: 4.500 sec.; Object size: 1051896; MD5: ca181afe324feb7efe04ad9dcc394961


8. Install the SqueezeNet 1.1 model to reproduce the failure as per the [log](https://github.com/naibaf7/caffe/files/1127408/issue69.stdout.log.txt):

$ ck install package:caffemodel-deepscale-squeezenet-1.1 $ ck show env --tags=caffemodel,squeezenet,v1.1 Env UID: Target OS: Bits: Name: Version: Tags:

933792a5a18249eb linux-64 64 Caffe model (net and weights) (deepscale, squeezenet, 1.1) 1.1 64bits,bvlc,caffe,caffemodel,deepscale,host-os-linux-64,net,squeezenet,target-os-linux-64,v1,v1.1,weights


9. Run the `caffe-time-opencl` program (with the device connected via `adb`, selecting the model installed in step 8 if prompted):

$ ck run program:caffe-time-opencl --target_os=android21-arm64 --cmd_key=default \ --env.CK_CAFFE_BATCH_SIZE=1 --env.CK_CAFFE_SKIP_BACKWARD


10. See the log:

$ ck find program:caffe-time-opencl /home/anton/CK_REPOS/ck-caffe/program/caffe-time-opencl $ cat /home/anton/CK_REPOS/ck-caffe/program/caffe-time-opencl/tmp/stdout.log

naibaf7 commented 7 years ago

Ok. Now it seems the same context is used for LibDNN kernels, right? This begs the question what is contained in the trace of the segfault now.

psyhtest commented 7 years ago

Two different contexts used for the main kernels and the libDNN kernels (what I reported to you via Skype) must have been a fidget of my imagination, sorry. I now see the contexts are always the same.

But the driver segfaults always in the same place. This is weird because by disabling ViennaCL caching I effectively ensured that the driver is fed the libDNN program as source via clCreateProgramWithSource() and then immediately the resulting binary via clBuildProgram().

I think I should report this issue to Qualcomm. Even if the libDNN program is ill-formed (unlikely due to clCreateProgramWithSource() being happy with it, as well as many other implementations I tested it with), the driver must give an error message, not segfault.

For reference, here's the driver info:

Platform ID: 0
Device ID: 0
Device: QUALCOMM Adreno(TM)
Vendor: QUALCOMM
Hardware (device) version: OpenCL 2.0 Adreno(TM) 540
Software (driver) version: OpenCL 2.0 QUALCOMM build: commit #dd296bd changeid #I7547f23799 Date: 03/29/17 Wed Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.5.7.C1.07.00.00.278.066 Compiler E031.32.00.01
OpenCL C version: OpenCL C 2.0 Adreno(TM) 540
Address bits: 64
Parallel compute units: 4
Work-item dimensions: 3
- max work-item size #0: 1024
- max work-item size #1: 1024
- max work-item size #2: 1024

naibaf7 commented 7 years ago

Thanks. Yes I noticed a similar problem actually with windows AMD drivers where the driver would segfault if the #pragma unroll at one point did not have an even number in it.

This is why there's this quirky line in it:

// Num tiles needs to be next higher even integer
// (due to some quirky bug in AMD OpenCL 2.0 on Windows)
LibDNN<Dtype>::add_def(ss, "v_num_tiles", "(((K - 1)/(TSK*2) + 1)*2)");

Maybe removing the #pragmas or other compiler hints from the source code will allow compilation in your case as well.

psyhtest commented 7 years ago

I will try with:

$ cd src/caffe/greentea
$ sed -i 's/^.*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_conv.cpp
$ sed -i 's/^.*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_deconv.cpp

changing the code, for example, as follows:

@@ -806,7 +809,7 @@ std::string LibDNNConv<Dtype>::generate_accreg_init(
   } else {
     // Zero init
     if (dterm) {
-      ss << "#pragma unroll" << std::endl;
+// #pragma unroll
       ss << "for (int_tp wm=0; wm<WPTM/VWM; ++wm) {" << std::endl;
       if (unroll) {
         for (int i = 0; i < vwm; ++i) {

NB: One additional manual fix is required in libdnn_deconv.cpp (line 1273):

// #pragma unroll
//     << this->wg_tuner_->template get_param<int>("TSK_UNROLL") << std::endl;

psyhtest commented 7 years ago

I've got a build failure with the above change:

ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options:
ViennaCL: clBuildProgram returned -11
ViennaCL: clBuildProgram failed
Build Status = -2 ( Err = -11 )
Log: BC-src-code:441:47: error: use of undeclared identifier 'Creg'
 Cptr[globalRow * N + globalCol] = ((Dtype*)(&(Creg[wm][wn/VWN])))[wn%VWN] + v_bmul * biasval;
                                               ^
BC-src-code:446:1: error: extraneous closing brace ('}')
 }
 ^
2 diagnostic(s) generated.

This is the offending code in context with hopefully correct line numbers:

 272 void conv_forward(
 ...
 435 for (int_tp wm=0; wm<WPTM; ++wm) {
 436 int_tp globalRow = offM + tidm + wm * RTSM;
 437 Dtype biasval = Dptr[globalRow];
 438 for (int_tp wn=0; wn<WPTN; ++wn) {
 439 int_tp globalCol = offN + tidn + wn * RTSN;
 440 if (globalRow < M && globalCol < N) {
 441 Cptr[globalRow * N + globalCol] = ((Dtype*)(&(Creg[wm][wn/VWN])))[wn%VWN] + v_bmul * biasval;
 442 }
 443 }
 444 }
 445 }
 446 }

naibaf7 commented 7 years ago

Hm just removing the pragmas can't cause that, there must have been a case where more was done than just commenting out the pragma, i.e. the stringstream misses a bracket or similar now.

psyhtest commented 7 years ago

Yeah, shouldn't have done but still. Please have a look at the new patch and log.

psyhtest commented 7 years ago

Hmm, I've made a deliberate typo in the source to print the original program, and, indeed, apart from the removed #pragma unroll, the modified program has these additional lines:

> for (int_tp wn=0; wn<WPTN/VWN; ++wn) {
> for (int_tp wm=0; wm<WPTM/VWM; ++wm) {
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 3][wm]);
> }
> }
> }
> barrier(CLK_LOCAL_MEM_FENCE);
> {
> Dtype4 Creg;
> for (int_tp lc = 0; lc < ((TSM*TSN-1)/(RTSM*RTSN))/VWM+1; ++lc) {
> int_tp tid = tidm * RTSN + tidn;
> int_tp id = lc * RTSN * RTSM + tid;
> int_tp row = (id / TSN) * VWM;
> int_tp col = id % TSN;
> int_tp globalRow = offM + row;
> int_tp globalCol = offN + col;
> VEC_4_0(Creg) = Asub[col][row + 0];
> if ((globalRow +0) < M && globalCol < N) {
> Cptr[(globalRow +0) * N + globalCol] = VEC_4_0(Creg) + Dptr[globalRow +0];
> }
> VEC_4_1(Creg) = Asub[col][row + 1];
> if ((globalRow +1) < M && globalCol < N) {
> Cptr[(globalRow +1) * N + globalCol] = VEC_4_1(Creg) + Dptr[globalRow +1];
> }
> VEC_4_2(Creg) = Asub[col][row + 2];
> if ((globalRow +2) < M && globalCol < N) {
> Cptr[(globalRow +2) * N + globalCol] = VEC_4_2(Creg) + Dptr[globalRow +2];
> }
> VEC_4_3(Creg) = Asub[col][row + 3];
> if ((globalRow +3) < M && globalCol < N) {
> Cptr[(globalRow +3) * N + globalCol] = VEC_4_3(Creg) + Dptr[globalRow +3];
> }
> }
> }

It seems that the first loop nest is doubly nested but it's terminated with three braces. This is where the brace imbalance may come from. But I'm lost to why removing an unroll pragma has this effect of introducing additional code.

naibaf7 commented 7 years ago

It seems that you accidentally uncommented the lines 1050 to 1072? That would be my guess. It's this part:

  // Store the final results in C
  /*ss << "#pragma unroll 1" << std::endl;
  ss << "for (int_tp wn=0; wn<WPTN/VWN; ++wn) {" << std::endl;
  ss << "#pragma unroll" << std::endl;
  ss << "for (int_tp wm=0; wm<WPTM/VWM; ++wm) {" << std::endl;
  for (int j = 0; j < vwn; ++j) {
    for (int i = 0; i < vwm; ++i) {
      ss << "Asub[(tidn+wn*RTSN)*VWN + " << j << "][(tidm + wn*RTSN)*VWM + " << i << "] = VEC_" << vwm << "_" << i << "(Creg[wn + " << j << "][wm]);" << std::endl;
    }
  }
  ss << "}" << std::endl;
  ss << "}" << std::endl;
  ss << "}" << std::endl;  // Scoping for C registers

  ss << "barrier(CLK_LOCAL_MEM_FENCE);" << std::endl;

  // Store the final results in C
  ss << "{" << std::endl; // Scoping for storing C
  ss << "Dtype" << vwm << " Creg;" << std::endl;
  ss << "#pragma unroll 1" << std::endl;
  ss << "for (int_tp lc = 0; lc < ((TSM*TSN-1)/(RTSM*RTSN))/VWM+1; ++lc) {" << std::endl;
  ss << "int_tp tid = tidm * RTSN + tidn;" << std::endl;
  ss << "int_tp id = lc * RTSN * RTSM + tid;" << std::endl;
  ss << "int_tp row = (id / TSN) * VWM;" << std::endl;
  ss << "int_tp col = id % TSN;" << std::endl;
  ss << "int_tp globalRow = offM + row;" << std::endl;
  ss << "int_tp globalCol = offN + col;" << std::endl;
  for (int i = 0; i < vwm; ++i) {
    ss << "VEC_" << vwm << "_" << i << "(Creg) = Asub[col][row + " << i << "];" << std::endl;
    ss << "if ((globalRow +" << i << ") < M && globalCol < N) {" << std::endl;
    if (bias_term_) {
      ss << "Cptr[(globalRow +" << i << ") * N + globalCol] = VEC_" << vwm << "_" << i << "(Creg) + Dptr[globalRow +" << i << "];" << std::endl;
    } else {
      ss << "Cptr[(globalRow +" << i << ") * N + globalCol] = VEC_" << vwm << "_" << i << "(Creg);" << std::endl;
    }
    ss << "}" << std::endl;
  }
  ss << "}" << std::endl;
  ss << "}" << std::endl; // Scoping for storing C*/

psyhtest commented 7 years ago

Ha, that would explain it, thanks! My regex wasn't good enough. I'll try again.

psyhtest commented 7 years ago

With the new regex only replacing #pragma printers starting with whitespace (Greentea patch):

$ sed -i  's/^\s*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_conv.cpp
$ sed -i  's/^\s*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_deconv.cpp

I produced a program without #pragmas. Unfortunately, the driver still segfaults.

I've sent the original program to the top OpenCL guys at Qualcomm. Will report back any workarounds they suggest.

naibaf7 commented 7 years ago

Ok thanks a lot. I currently have no other suggestion as to what could go wrong.

psyhtest commented 7 years ago

I've got a reply from Qualcomm which they allowed me to share here:

To work around the issue, please change arrays of float4 vectors to arrays of scalar float, whenever the array is used within a loop.

For example, in line 285 of issue69.libdnn.cl, you have:
Dtype4 Creg[WPTM][WPTN/VWN];
This can be changed to:
Dtype Creg[WPTM][WPTN/VWN * 4];
And subsequent use of Creg will need to be modified to reflect the change from Dtype4 to Dtype.

The issue occurs when such vector array is used in a loop, so if you have an array of vector that's not used in a loop, the issue will not happen. This can happen whether the array is declared as private memory within the kernel or whether it is passed in as kernel argument.

psyhtest commented 7 years ago

@naibaf7 I don't suppose you want to merge the above workaround to libDNN? I could probably try it on a separate branch if you would suggest me where to put these changes in.

naibaf7 commented 7 years ago

@psyhtest Thanks this is great to know. No I don't want to hard-code that into the code, but there are actually tuning parameters that change the vectorization data type, so if I remember my own code correctly, setting the LibDNN internal tuning parameters correctly should allow to compile it.

I'll also consider testing vector data access as a part of the pre-tuning phase of LibDNN then...

naibaf7 / caffe

clBuildProgram segfaults when building libDNN kernels on Snapdragon 835 #69