Open velvia opened 4 years ago
Currently u32x8 shuffle1_dyn are not optimized and fallback is used which results in a whole mess of extract intrinsics. It is not very fast.
u32x8
shuffle1_dyn
Can we please add support for _mm256_permutevar8x32_epi32 and similar variants at the u32x8 (and f32x8, etc.) levels? It is a fairly large speedup.
_mm256_permutevar8x32_epi32
Thanks
Wondering about this as well (it's 30x slower than what it should be, without warning the user).
(should this be posted to stdsimd repo?)
Yes, all development has moved there.
Currently
u32x8
shuffle1_dyn
are not optimized and fallback is used which results in a whole mess of extract intrinsics. It is not very fast.Can we please add support for
_mm256_permutevar8x32_epi32
and similar variants at the u32x8 (and f32x8, etc.) levels? It is a fairly large speedup.Thanks