[ hgemm ] hgemm noTrans with kernel 8x16

nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.

Apache License 2.0

144 stars 73 forks source link

This commit proposes a 8x16 kernel for Half-precision GEMM Note that this is not an '100%' optimized version of HGEMM, but still better than before. Following is unittest output with f16-f32 partial accumulated HGEMM. Fine accuracy with better latency.

mean latency ( TC = 20 )

GEMM dimension	fp32 (cblas)	prev	8x8	8x16
4096 square	2087 ms	7172 ms	...	1964 ms
2048 square	260 ms	413 ms	...	250 ms
1024 square	34 ms	52 ms	...	30 ms
768 square	13 ms	18 ms	...	11 ms
256X1440X256	2869 mcrs	3807 mcrs	...	2544 mcrs
256X256X1440	2929 mcrs	3950 mcrs	...	2467 mcrs
8X1440X8	5 mcrs	5 mcrs	...	10 mcrs
8X8X1440	5 mcrs	4 mcrs	...	8 mcrs

nnstreamer / nntrainer

[ hgemm ] hgemm noTrans with kernel 8x16 #2541