Closed yb303 closed 3 years ago
This would require significant code changes especially for C e.g. current we have:
__m128i libdivide_s32_do_vector(__m128i, const struct libdivide_s32_t *)
__m128i libdivide_u32_do_vector(__m128i, const struct libdivide_u32_t *)
__m128i libdivide_s64_do_vector(__m128i, const struct libdivide_s64_t *)
__m128i libdivide_u64_do_vector(__m128i, const struct libdivide_u64_t *)
If we wanted to support SSE2, AVX2 and AVX512 at the same time we would need to change the function names:
__m128i libdivide_s32_do_vector_sse2(__m128i, const struct libdivide_s32_t *)
__m128i libdivide_u32_do_vector_sse2(__m128i, const struct libdivide_u32_t *)
__m128i libdivide_s64_do_vector_sse2(__m128i, const struct libdivide_s64_t *)
__m128i libdivide_u64_do_vector_sse2(__m128i, const struct libdivide_u64_t *)
__m256i libdivide_s32_do_vector_avx2(__m256i, const struct libdivide_s32_t *)
__m256i libdivide_u32_do_vector_avx2(__m256i, const struct libdivide_u32_t *)
__m256i libdivide_s64_do_vector_avx2(__m256i, const struct libdivide_s64_t *)
__m256i libdivide_u64_do_vector_avx2(__m256i, const struct libdivide_u64_t *)
__m512i libdivide_s32_do_vector_avx512(__m512i, const struct libdivide_s32_t *)
__m512i libdivide_u32_do_vector_avx512(__m512i, const struct libdivide_u32_t *)
__m512i libdivide_s64_do_vector_avx512(__m512i, const struct libdivide_s64_t *)
__m512i libdivide_u64_do_vector_avx512(__m512i, const struct libdivide_u64_t *)
I consider this solution less elegant. The old SSE2 code was unmaintained for many years and nobody used it as far as I know. I have ported the SSE2 code to AVX2 and AVX512 just a few days ago. Personally I would like to wait and get more feedback from users on how they use the new vector code. If many users request this feature I will consider implementing it.
I decided to fix this. Now vector functions are tagged with the width:
libdivide_s32_do_vec128
libdivide_s64_do_vec128
libdivide_u32_do_vec128
libdivide_u64_do_vec128
libdivide_s32_do_vec256
libdivide_s64_do_vec256
libdivide_u32_do_vec256
libdivide_u64_do_vec256
libdivide_s32_do_vec512
libdivide_s64_do_vec512
libdivide_u32_do_vec512
libdivide_u64_do_vec512
libdivide_s32_branchfree_do_vec128
libdivide_s64_branchfree_do_vec128
libdivide_u32_branchfree_do_vec128
libdivide_u64_branchfree_do_vec128
libdivide_s32_branchfree_do_vec256
libdivide_s64_branchfree_do_vec256
libdivide_u32_branchfree_do_vec256
libdivide_u64_branchfree_do_vec256
libdivide_s32_branchfree_do_vec512
libdivide_s64_branchfree_do_vec512
libdivide_u32_branchfree_do_vec512
libdivide_u64_branchfree_do_vec512
Even with AVX512 support users may still want to use int128. I think in the current state older code moving to a newer box would force one to widen their ints