Closed skykongkong8 closed 1 month ago
:octocat: cibot: Thank you for posting issue #2582. The person in charge will reply soon.
Before SIMD (for-loops) , Galaxy S23
dim | prev | digits |
---|---|---|
768 | 465 mcrs | 589,824 |
512x2048 | 1.31 ms | 1,048,576 |
1920x1560 | 1.78 ms | 2,995,200 |
1560x2048 | 5.02 ms | 3,194,880 |
TC = 20, tested on Galaxy S23, with frequently used ones
dim | prev | neon |
---|---|---|
768x768 | 400 mcrs | 121 mcrs |
1440x1440 | 2 ms | 0.44 ms |
1920x1560 | 4.3 ~ 1.6 ms | 1.8 ~ 0.8 ms |
1560x2048 | 4.18 ms | 0.618 ms |
512x2048 | 1.31 ms | 0.18 ms |
Accelerate matrix transpose with SIMD
Matrix Transpose function in the latest NNTrainer (14.05.24) is implemented using for-loops. Although current implementation is useful for general use in (b,c,h,w)-Tensor transpose, it would be a little bit naive implementation for the (h,w)-matrix transpose.
Nevertheless, the NNTrainer is relying on such implementation quite often:
hgemm_noTrans
to all kinds of transposed GEMMsCurrently WIP to accelerate them with SIMD