larq / compute-engine

Highly optimized inference engine for Binarized Neural Networks
https://docs.larq.dev/compute-engine
Apache License 2.0
243 stars 35 forks source link

implement optimized BGEMM for ARM architecture #71

Closed arashb closed 4 years ago

arashb commented 5 years ago

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

TODO:

arashb commented 5 years ago

The TF lite has just recently started to use google ruy for the GEMM implementation for ARM devices (see https://github.com/tensorflow/tensorflow/commit/8924e67e034909bea0343631b9f9024c5a6da5c4 , https://github.com/tensorflow/tensorflow/commit/0939d5414566eeedd6b97cc2a6b0fd6800ae047f and https://github.com/tensorflow/tensorflow/commit/d4a934bb64734763ffd4db31a22c1668be9aa1b9).

@Tombana these changes are done only in bazel build and not in make file so they are not picked up in our build makefile, we either need to update our makefile or switch to bazel for TF lite as well.

for all other cases (expect iOS which provides fast BLAS impl.) TF lite still relies on Eigen (float) and gemmlowp (int8)

Tombana commented 5 years ago

I'll investigate if we can use bazel. We need to figure out: