nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.
Apache License 2.0
134 stars 71 forks source link

[ Tensor ] Accelerate fp16 matrix transpose with SIMD #2582

Closed skykongkong8 closed 1 month ago

skykongkong8 commented 1 month ago

Accelerate matrix transpose with SIMD

Matrix Transpose function in the latest NNTrainer (14.05.24) is implemented using for-loops. Although current implementation is useful for general use in (b,c,h,w)-Tensor transpose, it would be a little bit naive implementation for the (h,w)-matrix transpose.

Nevertheless, the NNTrainer is relying on such implementation quite often:

Currently WIP to accelerate them with SIMD

taos-ci commented 1 month ago

:octocat: cibot: Thank you for posting issue #2582. The person in charge will reply soon.

skykongkong8 commented 1 month ago

Updates : 14.05.2024

dim prev digits
768 465 mcrs 589,824
512x2048 1.31 ms 1,048,576
1920x1560 1.78 ms 2,995,200
1560x2048 5.02 ms 3,194,880
skykongkong8 commented 1 month ago

Latency measurement : 16.05.2024

TC = 20, tested on Galaxy S23, with frequently used ones

dim prev neon
768x768 400 mcrs 121 mcrs
1440x1440 2 ms 0.44 ms
1920x1560 4.3 ~ 1.6 ms 1.8 ~ 0.8 ms
1560x2048 4.18 ms 0.618 ms
512x2048 1.31 ms 0.18 ms