Open Triang3l opened 3 years ago
Could you point to the exact code that you have in mind?
Are you refering to this?
This implementation is essentially 7 years old. Are you interested in contributing an optimized version of this code?
Yes, and the vtbl4/vqtbl1q implementations for 32-bit and 64-bit swaps. I can try setting up the environment on my phone and write direct vrev versions, and possibly run some speed comparisons, as well as tests, in the weekend.
The Arm Neon versions of byte swaps (volk_*_byteswap.h) in VOLK use shifts/OR or lookup tables, somewhat similar to the x86 versions. However, Neon has a dedicated instruction for byte swaps — VREV, usable as
vrev16q_u8
for 8-in-16,vrev32q_u8
for 8-in-32, andvrev64q_u8
for 8-in-64. Are there performance/compatibility reasons for not using it, or is that more of not knowing about the instruction when the code was written?