Closed jserv closed 2 years ago
OK. Thanks. I'll replace it with a new one.
Would recommend considering just using ARM intrinsics - it shouldn't be too hard, and if it helps, here's a reference with most of what you'll likely need.
It let's you take advantage of ARM-specific stuff like interleaved loads/stores (vld2q_u8
/vst2q_u8
), which saves you from the pack/unpack routine.
Thank you. I'm so unfamiliar with NEON intrinsics I chose an easy way. But will check the reference and the loads/stores code. I will also include your https://github.com/animetosho/ParPar/blob/master/fast-gf-multiplication.md as a reference in the revised technical paper. Thank you again.
I have replaced the regular loads and stores with the interleaved ones. Yes, they are so fast and it was really fun to see the benchmark results. Thank you. Now it's time to redo the benchmark. I'll be able to finish it by the end of next week if nothing happens.
commit add344fe included SSE2NEON for Arm porting. However, the header file was considered as outdated, and you might instead take the latest SSE2NEON, which introduces bug fixes and more SSE intrinsics.