__m256i _mm256_blend_epi16(m256i a, m256i b, const int imm8)
__m256i _mm256_blend_epi32(m256i a, m256i b, const int imm8)
have different performance characteristics. Among them the function _mm256_blend_epi32() is the fastest but its mask needs to be encoded into an const int imm8 at compile-time. That hinders its use in the blend implementation of the current libsimdpp if I understand correctly (see also https://github.com/p12tic/libsimdpp/issues/56)
For masks that are already known at compile-time, I think it would be good to represent them in a new fashion. For instance the blend mask could be represented as a tuple from the library boost::hana
The intrinsics blend functions
have different performance characteristics. Among them the function _mm256_blend_epi32() is the fastest but its mask needs to be encoded into an
const int imm8
at compile-time. That hinders its use in the blend implementation of the current libsimdpp if I understand correctly (see also https://github.com/p12tic/libsimdpp/issues/56)For masks that are already known at compile-time, I think it would be good to represent them in a new fashion. For instance the blend mask could be represented as a tuple from the library boost::hana
The immediate mask for _mm256_blend_epi32() could then be computed at compile-time. I made an proof-of-concept implementation of this in
https://github.com/eriksjolund/compile-time-simd-blend-mask