StoreFinalOutputImpl::Run is reloading MatrixMap::data_ and

MatrixMap::stride_ in a loop.

The specialization for RegBlockUint8<8, 8> is the hottest one for our data based on profiling. Specialize it for MatrixMap and make a local copy of the destination so that the compiler can prove that data aliases neither &data nor &stride_.

This makes small model & GoogLeNet GEMMs about 1% faster.

Profiling shows that this mostly comes from "unpack to row-major" being faster:

Before: gemmlowp profile (1 threads, 9469 samples) 94.31% gemmlowp::MultiThreadGemm 94.30% gemmlowp::SingleThreadGemm 73.97% compute 62.39% optimized kernel 11.57% other 8.73% pack LHS 6.51% unpack to column-major 4.95% unpack to row-major 0.13% pack RHS 0.01% other 0.01% other 5.69% other (outside of any label)

After: 93.89% gemmlowp::MultiThreadGemm 93.89% gemmlowp::SingleThreadGemm 74.80% compute 61.82% optimized kernel 12.98% other 9.03% pack LHS 6.24% unpack to column-major 3.68% unpack to row-major 0.12% pack RHS 0.02% other 0.00% other 6.11% other (outside of any label)

google / gemmlowp

StoreFinalOutputImpl::Run is reloading MatrixMap::data_ and #194