lhnguyen102 / cuTAGI

CUDA implementation of Tractable Approximate Gaussian Inference
MIT License
31 stars 10 forks source link

Layer Input and PACK_SIZE issue with CUDA #92

Open miquelflorensa opened 1 month ago

miquelflorensa commented 1 month ago

@lhnguyen102 When using an input layer with size different to multiple of PACK_SIZE the Kernel crashes by misaligned address. I only experienced this issue in while running on GPU.

I guess some modifications needs to be done in the set_buffer_size function:

void Sequential::set_buffer_size()
/*
 */
{
    for (auto &layer : this->layers) {
        int max_size = layer->get_max_num_states();
        this->z_buffer_size = std::max(max_size, this->z_buffer_size);
    }

    // Convert to the size that is multiple of PACK_SIZE
    if (this->z_buffer_size % PACK_SIZE != 0) {
        this->z_buffer_size =
            ((this->z_buffer_size + PACK_SIZE - 1) / PACK_SIZE) * PACK_SIZE;
    }
}

I will try to solve it by myself, but I wanted to leave the issue here.

lhnguyen102 commented 1 month ago

@miquelflorensa Thank you for pointing that out. I thought the code is able to handle such a case. Could you please share your model details? If you find a solution, please don't hesitate to create a PR to fix it.

miquelflorensa commented 1 month ago

@lhnguyen102 I have experienced the issue with a FNN with 785 inputs in CUDA (when I use 788 it works fine). I still need to see exactly in which situations this error happens or if I did something wrong from my side. Whenever I find the issue I'll update this issue.

Specific Architecture:

FNN = Sequential(
    Linear(28 * 28 + 1, 6000),
    ReLU(),
    Linear(6000, 28 * 28),
)

For any batch size.

lhnguyen102 commented 1 month ago

@miquelflorensa I recently optimized the CUDA kernels for the the memory access pattern where it is preferable to have a vector size = multiple of PACK_SIZE such as 4. It allows accessing 4 elements at once, leading a faster performance. Normally, there is a dispatch mechanism to switch back to the not-optimized kernels for such as your cases. I'll double check because there might be a bug there.

miquelflorensa commented 1 month ago

@lhnguyen102 I'm still not sure why it crashes in my case or under what exact circumstances. I tested it with inputs like 13 for the Boston Housing dataset, and it worked fine, so it’s possible I’m missing something, but I haven’t identified it yet. It’s not urgent to find the source of the problem, but I'll update the thread once I do.