Kernel Optimization for Linear Layer

Description

This PR optimize all forward and backward kernels for Linear layer for NVIDIA GPU by leveraging the shared memory and register files.

Added new kernels for forward and backward passes in Linear layer
Fixed memory leak across layers
Merged state_backward and param_backward to backward function (see base_layer.cpp)
Added small-size benchmark that compares TAGI with Pytorch

close #71

7-10x faster on Linear layer having more than 4096 nodes, but 1-3x below that threshold
Please don't merge right away when you're done reviewing it. I want to remove all the temporary imports for the realease Here is the command to run the small-size benchmark between tagi and pytorch on CUDA device

TAGI

python -m examples.mnist_bench tagi

Pytorch

python -m examples.mnist_bench torch

Memory usage in cuTAGI is only 15-20% higher than PyTorch which is really a good thing. I am surprised because theoretically, TAGI approach stores 2x more memory on each variable -> need further investigation.
learning rate for the pytorch model has not optimized yet, so its accuracy is not great. In the other hand, TAGI seems to be less sensitive to the \sigma_V because it achieves a better performance with a random guess of \sigma_v. Enjoy the speed 🚀