Closed peabody-korg closed 7 years ago
the incorrect shift appears to have been fixed in a7f191d5, however use of _mm_movelh_ps() might still make for more efficient code.
Thanks, this improvement has been applied in 227cc0a8e79ceab4eda89126ca1e98b3ddc82c85.
the operation
r2 = move4_l<2>(r2);
shifts the wrong way and leaves r2 full of zeros. The upper 2 lanes of the result wind up being filled with zeros.
Furthermore, it looks like merging the two intermediate vectors with _mm_movelh_ps() might be more efficient than shifting and oring. Compiler optimization might produce that anyway, but _mm_movelh_ps() seems more straightforward.