mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.33k stars 96 forks source link

Optimize integer BLAS (GEMM and GEMV) #25

Closed mratsim closed 6 years ago

mratsim commented 6 years ago

Follow-up on https://github.com/mratsim/Arraymancer/issues/6

Basic integer matrix multiplcation and matrix-vector multiplication are implemented.

Several optimizations can be implemented to speed up integer computation further.

Can be done

Unsure if possible

D Mir GLAS uses LLVM intrinsics to get this information.

Hard and not portable

Use AVX2 intrinsics. AVX2 operations support integer, unfortunately it is hard to get the compiler to use them automatically. An alternative could be to code directly with intrinsics.

As Arraymancer integer GEMM is based on ulmBLAS design, intrinsics can be implemented following ulmBLAS course: http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/

Unsure if helpful

Using pointers instead of seq + offset. While it seems like less computation (no bound check, no recomputing of the new position during iteration), using seq means the compiler can do much more assumption about the data layout and optimize access (and GEMM is memory-bound, computing the new position is cheap) An experiment with safe pointers (unsafe when build for release) can be found in pointer_GEMM branch.

mratsim commented 6 years ago

Regarding pointers Nim defaults to -fno-strict-aliasing in the -d:release compile flags which probably prevents some vectorization when gcc is unsure if pointers are aliased or not.

See this and this for explanation on aliasing.

Note: if seq are using the restrictkeyword when looping on data the no-strict-aliasing flag shouldn't have that much impact on the seq version.