Kernel Optimization for Conv2d and Batchnorm2d

lhnguyen102 commented 3 weeks ago

Description 🚀

The goal is to optimize the forward and backward GPU kernels for Conv2d and BatchNorm2d by leveraging on-chip memory and sum reduction.

Added optimized kernels for conv2d and batchnorm2d
Added resnet18 architecture on cifar10 both on C++ and Python API
Merged param_backward and state_backward to backward functions. backward function will be simplified in the future
Added some cuda checks

pytorch package will be required to run resnet18 because we leverage pytorch's dataloader for preprocessing cifar10 images
Run the following command to test resnet18 (3 mins/epoch) on RTX 3090 ti
```
python -m examples.resnet18_cifar10
```
Running the following command to see the speedup gain of Mnist example (5s/epoch) on RTX 3090 ti. Note that it requires to uncomment cuda device to run on GPU
```
python -m examples.classification
```
Unfortunately, it will need a couple of optimization rounds to run imagenet benchmark

lhnguyen102 commented 3 weeks ago

@jamesgoulet Thanks for testing. It seems quite slow on Quadro rtx then :(

jamesgoulet commented 3 weeks ago

@jamesgoulet Thanks for testing. It seems quite slow on Quadro rtx then :(

PyTorch will also be slower on it 🤙