naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
85 stars 20 forks source link

need some help in caffe opencl kernel of im2col.cl #46

Closed zazd closed 7 years ago

zazd commented 7 years ago

I want to make change to the im2col of opencl for my ideas, like transposition and so on. When I read the im2col.cl in caffe-opencl, I feel uncertain in it and result in a wrong output. I can understand the im2col in cpu and have a certain of understanding of opencl. Can you tell me what the meaning of this line: for (int_tp index = get_global_id(0); index < n; index += get_global_size(0)) and you use ViennaCL to support opencl, but I can not find your setting of global_work_size and local_work_size, it seems that you do not set it or in the place that I ignore.

Thank you !

naibaf7 commented 7 years ago

ViennaCL/OpenCL picks the work sizes itself, and lowers the values if the kernel cannot launch at a certain size. The kernel is written in a way that that works with any local or global size.

The loop you are mentioning loops over all column data elements (int_tp num_kernels = channels * height_col * width_col;). How often a GPU thread has to loop through there depends on what global work size is picked by OpenCL.

zazd commented 7 years ago

I still can not understand the loop, can you tell me the possible value of get_global_id(0) and get_global_size(0), or tell me the original loop of this?

zazd commented 7 years ago

for example ,the kernel is 3 * 3 and the data is 4 * 4 {1,2,3,.......,16}, 3 channel(each channel is same)

naibaf7 commented 7 years ago

@zazd get_global_id(0) is 0 to get_global_size(0), depending which thread it is. get_global_size(0) is the actual number of threads. This can be anything from 1 to 1024 (or even more), depending on the device.

This should also make it clear:

template<typename Dtype>
void greentea_im2col_gpu(viennacl::ocl::program *prog,
                         viennacl::ocl::context *ctx, const cl_mem data_im,
                         const int_tp data_offset, const int_tp channels,
                         const int_tp height, const int_tp width,
                         const int_tp kernel_h, const int_tp kernel_w,
                         const int_tp pad_h, const int_tp pad_w,
                         const int_tp stride_h, const int_tp stride_w,
                         const int_tp dilation_h, const int_tp dilation_w,
                         cl_mem data_col, const int_tp data_col_off) {
  int_tp height_col = (height + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1))
      / stride_h + 1;
  int_tp width_col = (width + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1))
      / stride_w + 1;
  int_tp num_kernels = channels * height_col * width_col;

  viennacl::ocl::kernel &kernel = prog->get_kernel(CL_KERNEL_SELECT("im2col"));

  viennacl::ocl::enqueue(
      kernel(num_kernels, WrapHandle(data_im, ctx), data_offset, height, width,
             kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h,
             dilation_w, height_col, width_col, WrapHandle(data_col, ctx),
             data_col_off),
      ctx->get_queue());
}
zazd commented 7 years ago

I know it, thank you. So nice as you.