The goal is to optimize the forward and backward GPU kernels for Conv2d and BatchNorm2d by leveraging on-chip memory and sum reduction.
Changes Made
Added optimized kernels for conv2d and batchnorm2d
Added resnet18 architecture on cifar10 both on C++ and Python API
Merged param_backward and state_backward to backward functions. backward function will be simplified in the future
Added some cuda checks
Note for Reviewer(s)
pytorch package will be required to run resnet18 because we leverage pytorch's dataloader for preprocessing cifar10 images
Run the following command to test resnet18 (3 mins/epoch) on RTX 3090 ti
python -m examples.resnet18_cifar10
Running the following command to see the speedup gain of Mnist example (5s/epoch) on RTX 3090 ti. Note that it requires to uncomment cuda device to run on GPU
python -m examples.classification
Unfortunately, it will need a couple of optimization rounds to run imagenet benchmark
Description 🚀
The goal is to optimize the forward and backward GPU kernels for Conv2d and BatchNorm2d by leveraging on-chip memory and sum reduction.
Changes Made
Note for Reviewer(s)