Closed easyaspi314 closed 5 years ago
Hi, easyaspi314. Thanks for your input!!!! it helps indeed. The only place that have to be changed is the _mm_set1_epi64x function - unfortunately it is not available say in VS compiler (for 32 bit version). So if you change it to another "set" - _epi32 with the corresponding arguments, I will merge your commit with a great pleasure. Thanks in advance!
I also added a version that uses _mm_blend_epi16
for SSE4.1 which only requires a pxor
instead of movqda
.
The final thanks for this useful commit!
vpaddlq_uN can be implemented as so:
and the other unsigned pairwise adds are the same.
vpaddlq_s32 can be implemented like so:
And _mm_mullo_epi32 uses the same routine that GCC uses with vector extensions (Clang uses a similar method, but it uses pshufd which is slow on pre-Penryn chips):