lhnguyen102 / cuTAGI

CUDA implementation of Tractable Approximate Gaussian Inference
MIT License
29 stars 9 forks source link

Kernel Optimization for Linear Layer #72

Closed lhnguyen102 closed 1 month ago

lhnguyen102 commented 1 month ago

Description

This PR optimize all forward and backward kernels for Linear layer for NVIDIA GPU by leveraging the shared memory and register files.

Changes Made

Related Issue(s)

close #71

Notes for Reviewers(s)

TAGI

python -m examples.mnist_bench tagi 

Pytorch

python -m examples.mnist_bench torch
jamesgoulet commented 1 month ago

@lhnguyen102 again, this is unreal πŸ€―πŸš€. I tested on our machine and this is 2:17 for TAGI vs. 2:26 for Torch. I will let @miquelflorensa Miquel test the ResNet architecture to confirm that his issue is fixed.