[ Tensor ] Accelerate fp16 matrix transpose with SIMD

nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.

Apache License 2.0

134 stars 71 forks source link

[ Tensor ] Accelerate fp16 matrix transpose with SIMD #2582

Closed skykongkong8 closed 1 month ago

skykongkong8 commented 1 month ago

Accelerate matrix transpose with SIMD

Matrix Transpose function in the latest NNTrainer (14.05.24) is implemented using for-loops. Although current implementation is useful for general use in (b,c,h,w)-Tensor transpose, it would be a little bit naive implementation for the (h,w)-matrix transpose.

Nevertheless, the NNTrainer is relying on such implementation quite often:

when using BiQGEMM
layers ops
internally in some tensor ops
can be extended in hgemm_noTrans to all kinds of transposed GEMMs

Currently WIP to accelerate them with SIMD

taos-ci commented 1 month ago

:octocat: cibot: Thank you for posting issue #2582. The person in charge will reply soon.

skykongkong8 commented 1 month ago

Updates : 14.05.2024

Current for-loop implementation tends to get nonlinearly increase in terms of latency as matrix dimension linearly increase. Implementing a kernel-based SIMD version of matrix transpose will therefore have more impact.

Before SIMD (for-loops) , Galaxy S23

dim	prev	digits
768	465 mcrs	589,824
512x2048	1.31 ms	1,048,576
1920x1560	1.78 ms	2,995,200
1560x2048	5.02 ms	3,194,880

Found a way to transpose 16-bit matrix with NEON. Using matrix transpose kernel and properly applying them to corresponding indices, SIMD-accelerated matrix transpose would definitely work. Currently conducting some experiments.

skykongkong8 commented 1 month ago

Latency measurement : 16.05.2024

TC = 20, tested on Galaxy S23, with frequently used ones

dim	prev	neon
768x768	400 mcrs	121 mcrs
1440x1440	2 ms	0.44 ms
1920x1560	4.3 ~ 1.6 ms	1.8 ~ 0.8 ms
1560x2048	4.18 ms	0.618 ms
512x2048	1.31 ms	0.18 ms

Overall, 200%~ 500% acceleration. ( Note that this method is effective for sufficiently big matrices)
PR for this function will be proposed in the near future.