Basic support for Custom Kernels.

bauom commented 10 months ago

This issue aims to add the feature of creating Custom Kernels in the Numba style. below you can find an example of a kernel definition which can be called in the code in the following format. increment_by_one[BN, TPB](args): BN : is the number of blocks to be dispatched on the GPU. TPB: is the number of threads on each block. this can be implemented by checking if IndexedElement in the semantic stage is a FunctionCall and replace it in the AST with a KernelCall node. a KernelCall can be detected if an IndexedElement contains a FunctionCall which is decorated by the kernel decorator.

Numba code:

from numba import cuda
@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

Pyccel code:

from pyccel.internals import cuda
from pyccel.decorators import kernel
@kernel
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx(0)
    # Block id in a 1D grid
    ty = cuda.blockIdx(0)
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim(0)
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

EmilyBourne commented 10 months ago

from pyccel.decorators import kernel
@kernel
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx(0)
    # Block id in a 1D grid
    ty = cuda.blockIdx(0)
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim(0)
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

Your code looks a little problematic to me. cuda is not defined anywhere. Did you want to treat it as a built-in?

I would have expected code such as:

from numba import cuda
from pyccel.decorators import kernel

@kernel
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

Would the latter run in pure Python at all ?

bauom commented 10 months ago

from pyccel import cuda
from pyccel.decorators import kernel
@kernel
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx(0)
    # Block id in a 1D grid
    ty = cuda.blockIdx(0)
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim(0)
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

sorry I missed a from pyccel import cuda as the aim in this first step is just to have a cuda Pyccel internal library that is not specific to a library that we can later use it with all the libraries.

EmilyBourne commented 10 months ago

as the aim in this first step is just to have a cuda Pyccel internal library that is not specific to a library

In that case shouldn't it be from pyccel.internals import cuda to match what is done for the other internal libraries?

bauom commented 10 months ago

as the aim in this first step is just to have a cuda Pyccel internal library that is not specific to a library

In that case shouldn't it be from pyccel.internals import cuda to match what is done for the other internal libraries?

yeah that would be better 👍 will change it.

pyccel / pyccel-cuda

Basic support for Custom Kernels. #28