libocca / occa

Portable and vendor neutral framework for parallel programming on heterogeneous platforms.
https://libocca.org
MIT License
390 stars 83 forks source link

Add an example with 2d/3d blocks and threads in OKL #223

Closed mesonepigreco closed 5 years ago

mesonepigreco commented 5 years ago

Dear developers, I would like to use a kernel function that exploits the blocks and threads as dim3 vectors (like it is possible in CUDA or OpenCL), so using multiple inner and outer loops. The CUDA syntax is something like:

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
array2d[Ny*i + j]  = bla bla bla;

I saw an old example on these slides where it is used, but the syntax of OKL used there is different from the current one, so I wonder if it is changed. https://www.caam.rice.edu/~mk51/presentations/SIAMPP2016_6.pdf

I could not find any example of it between those provided, it would be great to add one in which this feature is exploited. Can this be done in the current OKL? If so it is, how?

dmed256 commented 5 years ago

The syntax did change a bit but the concepts should be the same. Here's a 2D example and their translations using

./bin/occa translate --mode CUDA test.okl
@kernel void test(float *array2d) {
  for (int yBlock = 0; yBlock < Y; yBlock += 16; @outer) {
    for (int xBlock = 0; xBlock < X; xBlock += 16; @outer) {
      for (int j = yBlock; j < (yBlock + 16); ++j; @inner) {
        for (int i = xBlock; i < (xBlock + 16); ++i; @inner) {
          array2d[j*X + i] = "bla bla bla";
        }
      }
    }
  }
}


extern "C" __global__ void _occa_test_0(float *array2d) {
  {
    int yBlock = 0 + (16 * blockIdx.y);
    {
      int xBlock = 0 + (16 * blockIdx.x);
      {
        int j = yBlock + threadIdx.y;
        {
          int i = xBlock + threadIdx.x;
          array2d[j * X + i] = "bla bla bla";
        }
      }
    }
  }
}

The @outer and @inner can take an optional argument to specify the dimension that for-loop is traversing. By default the dimensions are x, y, and z from inner-most to outer-most @outer for-loop.

Here's an example switching the x and y blocks

@kernel void test(float *array2d) {
  for (int yBlock = 0; yBlock < Y; yBlock += 16; @outer(0)) {
    for (int xBlock = 0; xBlock < X; xBlock += 16; @outer(1)) {
      for (int j = yBlock; j < (yBlock + 16); ++j; @inner(0)) {
        for (int i = xBlock; i < (xBlock + 16); ++i; @inner(1)) {
          array2d[j*X + i] = "bla bla bla";
        }
      }
    }
  }
}

extern "C" __global__ void _occa_test_0(float *array2d) {
  {
    int yBlock = 0 + (16 * blockIdx.x);
    {
      int xBlock = 0 + (16 * blockIdx.y);
      {
        int j = yBlock + threadIdx.x;
        {
          int i = xBlock + threadIdx.y;
          array2d[j * X + i] = "bla bla bla";
        }
      }
    }
  }
}

Although the documentation is a bit lacking, here are 2 places to checkout

mesonepigreco commented 5 years ago

Thank you a lot, This is exactly what I was looking for.

dmed256 commented 5 years ago

@mesonepigreco Awesome, glad it helped! Feel free to ask any questions, it'll help target what documentation to update :)