implement optimized BGEMM for ARM architecture

larq / compute-engine

Highly optimized inference engine for Binarized Neural Networks

https://docs.larq.dev/compute-engine

Apache License 2.0

243 stars 35 forks source link

implement optimized BGEMM for ARM architecture #71

Closed arashb closed 4 years ago

arashb commented 5 years ago

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

Eigen
OpenBLAS -> probably only for x86
TF lite RUY impl.
ARM ComputeLibrary
Google XNNPACK

TODO:

[x] understanding the RUY codebase
[ ] extending the 8-bit assembly kernels (32/64-bit NEONs) for binary gemm with 8bit bitpacking
[ ] writing 32-bit assembly kernels (32/64-bit NEONs) for binary gemm with 32bit bitpacking
[ ] writing 64-bit assembly kernels (32/64-bit NEONs) for binary gemm with 64bit bitpacking

arashb commented 5 years ago

The TF lite has just recently started to use google ruy for the GEMM implementation for ARM devices (see https://github.com/tensorflow/tensorflow/commit/8924e67e034909bea0343631b9f9024c5a6da5c4 , https://github.com/tensorflow/tensorflow/commit/0939d5414566eeedd6b97cc2a6b0fd6800ae047f and https://github.com/tensorflow/tensorflow/commit/d4a934bb64734763ffd4db31a22c1668be9aa1b9).

@Tombana these changes are done only in bazel build and not in make file so they are not picked up in our build makefile, we either need to update our makefile or switch to bazel for TF lite as well.

for all other cases (expect iOS which provides fast BLAS impl.) TF lite still relies on Eigen (float) and gemmlowp (int8)

Tombana commented 5 years ago

I'll investigate if we can use bazel. We need to figure out:

[ most important ] adding our ops to the library without having a copy of their bazel file, so that it automatically updates when they update
Cross-compiling with bazel for raspberry pi. (With the Makefile system this is easy.)