simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.36k stars 247 forks source link

Improve performance of simde_mm512_add_epi32 #1126

Closed AymenQ closed 8 months ago

AymenQ commented 8 months ago

Improve and simplify implementation of simde_mm512_add_epi32 as follows:

  1. Remove the explicit SVE implementation. For SVE vector lengths of VL={128, 256}, this explicit vector length agnostic (VLA) SVE loop performs significantly worse than the Neon equivalent, which can be executed using fewer instructions. This sequence of SVE intrinsics is also malformed according to clang, so it fails to compile altogether.

  2. Preferentially use GCC's vector extension if available, instead of repeated calls to simde_mm256_add_epi32. There are a couple of reasons for this:

    1. The added indirection results in worse code generation. See the code generation attached to commit message for an example with GCC 13.

    2. GCC's vector extension is an easier optimization target for compilers, allowing them to appropriately output performant code generation depending on their own internal cost & tuning models. See the snippets attached to commit message for an example of improved code-gen in a vector length specific (VLS) context.

This brings the implementation of simde_mm512_add_epi32 back in line with other similar AVX512 intrinsics, such as simde_mm512_sub_epi32 and simde_mm512_mul_ps.

Fixes #980.

mr-c commented 8 months ago

@AymenQ Thank you for your PR; I'm debugging the clang-16 issue over at https://github.com/simd-everywhere/simde/pull/1127