Closed rosbif closed 4 years ago
rosbif, Thanks a lot for your contribution. As for the vqtblx algorithm improvement - it looks legit, I will accept it for sure (just give me some time please). As for the vqtbl?q_ functions added - it is not an easy question. These functions belong to A64 not to the original ARM NEON set. And it means while they are useful I don't have any tests for them and even if I get them I need to specify somehow their A64 nature... Need to think it over. Thanks again.
Hi Victoria,
Le 14/01/2020 à 16:52, Victoria a écrit :
rosbif, Thanks a lot for your contribution. As for the vqtblx algorithm improvement - it looks legit, I will accept it for sure (just give me some time please). As for the vqtbl?q_ functions added - it is not an easy question. These functions belong to A64 not to the original ARM NEON set. And it means while they are useful I don't have any tests for them and even if I get them I need to specify somehow their A64 nature... Need to think it over. Thanks again.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intel/ARM_NEON_2_x86_SSE/pull/37?email_source=notifications&email_token=AAEHHKY7WNUVEZ42H7YPIZDQ5XNUTA5CNFSM4KCU6XB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5DTXY#issuecomment-574241247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEHHK57HB3R5BDPAUWICJTQ5XNUTANCNFSM4KCU6XBQ.
I must admit that I am new to GitHub (which was probably apparent as I was a bit clumsy) and also new to NEON.
I originally wrote my code for SSE and AVX2. I was a bit bored over the holidays so I thought that it would be amusing to try AVX-512 and NEON versions. I found your excellent work which enabled me to test equivalent NEON instructions on my x86_64 hardware. The only missing instruction I needed was vqtbl1q_u8 (to replace_mm_shuffle_epi8) so I added it. Subsequently I added the others to complete the set.
I was amazed that with your superb NEON2SSE work I obtained nearly the same performance with simulated NEON as with native SSE.
Thank you for your great work.
Cheers, Chris
I am closing this because, looking at it again, I think it is buggy. Sorry :-(
Add vqtbl?q_?8 intrinsics. This fixes issue #33 "The vqtbl* intrinsics are missing". This replaces pull request #36.
Edit: I subsequently committed an improved algorithm for the vqtbl2q, vqtbl3q and vqtbl4q intrinsics which is faster, particularly with SSE4.