Improve performance of simde_mm512_add_epi32

Improve and simplify implementation of simde_mm512_add_epi32 as follows:

Remove the explicit SVE implementation. For SVE vector lengths of VL={128, 256}, this explicit vector length agnostic (VLA) SVE loop performs significantly worse than the Neon equivalent, which can be executed using fewer instructions. This sequence of SVE intrinsics is also malformed according to clang, so it fails to compile altogether.
Preferentially use GCC's vector extension if available, instead of repeated calls to simde_mm256_add_epi32. There are a couple of reasons for this:
1. The added indirection results in worse code generation. See the code generation attached to commit message for an example with GCC 13.
2. GCC's vector extension is an easier optimization target for compilers, allowing them to appropriately output performant code generation depending on their own internal cost & tuning models. See the snippets attached to commit message for an example of improved code-gen in a vector length specific (VLS) context.

This brings the implementation of simde_mm512_add_epi32 back in line with other similar AVX512 intrinsics, such as simde_mm512_sub_epi32 and simde_mm512_mul_ps.

Fixes #980.

simd-everywhere / simde

Improve performance of simde_mm512_add_epi32 #1126