The specialization for RegBlockUint8<8, 8> is the hottest one for our
data based on profiling. Specialize it for MatrixMap and make a local copy of
the destination so that the compiler can prove that data aliases neither &data
nor &stride_.
This makes small model & GoogLeNet GEMMs about 1% faster.
Profiling shows that this mostly comes from "unpack to row-major" being faster:
Before:
gemmlowp profile (1 threads, 9469 samples)
94.31% gemmlowp::MultiThreadGemm
94.30% gemmlowp::SingleThreadGemm
73.97% compute
62.39% optimized kernel
11.57% other
8.73% pack LHS
6.51% unpack to column-major
4.95% unpack to row-major
0.13% pack RHS
0.01% other
0.01% other
5.69% other (outside of any label)
After:
93.89% gemmlowp::MultiThreadGemm
93.89% gemmlowp::SingleThreadGemm
74.80% compute
61.82% optimized kernel
12.98% other
9.03% pack LHS
6.24% unpack to column-major
3.68% unpack to row-major
0.12% pack RHS
0.02% other
0.00% other
6.11% other (outside of any label)
MatrixMap::stride_ in a loop.
The specialization for RegBlockUint8<8, 8> is the hottest one for our data based on profiling. Specialize it for MatrixMap and make a local copy of the destination so that the compiler can prove that data aliases neither &data nor &stride_.
This makes small model & GoogLeNet GEMMs about 1% faster.
Profiling shows that this mostly comes from "unpack to row-major" being faster:
Before: gemmlowp profile (1 threads, 9469 samples) 94.31% gemmlowp::MultiThreadGemm 94.30% gemmlowp::SingleThreadGemm 73.97% compute 62.39% optimized kernel 11.57% other 8.73% pack LHS 6.51% unpack to column-major 4.95% unpack to row-major 0.13% pack RHS 0.01% other 0.01% other 5.69% other (outside of any label)
After: 93.89% gemmlowp::MultiThreadGemm 93.89% gemmlowp::SingleThreadGemm 74.80% compute 61.82% optimized kernel 12.98% other 9.03% pack LHS 6.24% unpack to column-major 3.68% unpack to row-major 0.12% pack RHS 0.02% other 0.00% other 6.11% other (outside of any label)