scopedog / gf-nishida-16

A Simple and Fast 16bit Galois Field Arithmetic Library in C
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Use latest SSE2NEON #3

Closed jserv closed 2 years ago

jserv commented 2 years ago

commit add344fe included SSE2NEON for Arm porting. However, the header file was considered as outdated, and you might instead take the latest SSE2NEON, which introduces bug fixes and more SSE intrinsics.

scopedog commented 2 years ago

OK. Thanks. I'll replace it with a new one.

animetosho commented 2 years ago

Would recommend considering just using ARM intrinsics - it shouldn't be too hard, and if it helps, here's a reference with most of what you'll likely need.
It let's you take advantage of ARM-specific stuff like interleaved loads/stores (vld2q_u8/vst2q_u8), which saves you from the pack/unpack routine.

scopedog commented 2 years ago

Thank you. I'm so unfamiliar with NEON intrinsics I chose an easy way. But will check the reference and the loads/stores code. I will also include your https://github.com/animetosho/ParPar/blob/master/fast-gf-multiplication.md as a reference in the revised technical paper. Thank you again.

scopedog commented 2 years ago

I have replaced the regular loads and stores with the interleaved ones. Yes, they are so fast and it was really fun to see the benchmark results. Thank you. Now it's time to redo the benchmark. I'll be able to finish it by the end of next week if nothing happens.