Experiment with writing a custom matmul() to speed up calculations on small matrices

This effort should be undertaken only if linking to the optimized library versions investigated issue #66 does not reduce runtimes.

Most calls to matmul() in the computationally intensive kernels are for small matrices. These may not benefit from the compiler's implementation of matmul or linking to and calling BLAS or other library versions.

[ ] Write simple & possibly naive inline matmul() using loops
[ ] Experiment with loop optimization strategies (unrolling, fusion, inversion, etc.) & index ordering

radiasoft / zgoubi

Experiment with writing a custom matmul() to speed up calculations on small matrices #67