Optimize integer BLAS (GEMM and GEMV)

Follow-up on https://github.com/mratsim/Arraymancer/issues/6

Basic integer matrix multiplcation and matrix-vector multiplication are implemented.

Several optimizations can be implemented to speed up integer computation further.

Can be done

Automatic loop unrolling. Nim has an unrollpragma that currently does nothing. Alternatively, unrolling can be implemented like in nim GC.
Ensure 16-bit alignment of bufferArray. This is required for AVX/AVX2 optimization. Creating a custom pragma like today https://github.com/mratsim/Arraymancer/blob/d65c6180ef5d3a8c1483c0526bccd387766476f5/src/arraymancer/fallback/blas_l3_gemm.nim#L96 probably align the pointer and not the actual data. Also align pragma which should allow this doesn't work: https://github.com/nim-lang/Nim/issues/5315
Add OpenMP (except on macOS as Clang macOS is not built with OpenMP support and it's a pain to override macOS compiler)

Unsure if possible

Get L1/L2 cache size at compile-time
Get number of registers at compile-time

D Mir GLAS uses LLVM intrinsics to get this information.

Hard and not portable

Use AVX2 intrinsics. AVX2 operations support integer, unfortunately it is hard to get the compiler to use them automatically. An alternative could be to code directly with intrinsics.

As Arraymancer integer GEMM is based on ulmBLAS design, intrinsics can be implemented following ulmBLAS course: http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/

Unsure if helpful

Using pointers instead of seq + offset. While it seems like less computation (no bound check, no recomputing of the new position during iteration), using seq means the compiler can do much more assumption about the data layout and optimize access (and GEMM is memory-bound, computing the new position is cheap) An experiment with safe pointers (unsafe when build for release) can be found in pointer_GEMM branch.

mratsim / Arraymancer