Share code between implementations

One thing about SIMD implementations is that there is often a direct equivalent of each intrinsic. Basically everything has add, sub, shift right, etc,.

Therefore, there could be a "common" folder containing the more common intrinsics. Then, we can just reuse this in the platform-specific intrinsic polyfills and avoid any copy-paste errors/missed optimizations.

I would just use an extension of NEON types since NEON has the strongest type system.

This also lets us elegantly handle differing native vector sizes by divide and conquer:

If a native vector size matches the intrinsic's vector size, use the native intrinsics.
If a native vector size is smaller than the intrinsic's vector size (or we are scalar-only), split in half and use the next size down. This already seems to be done in AVX.
If the minumum native vector size is larger than the intrinsic's vector size, widen, use the native intrinsic size, then narrow. This only seems to be necessary for 64-bit vectors. See #1025 for some research on this logic.

// Basic element specific scalar code
SIMDE_FUNCTION_ATTRIBUTES
int32_t simde_add_s32(int32_t a, int32_t b) {
  return a + b;
}
/* forward declare */
SIMDE_FUNCTION_ATTRIBUTES
simde_int32x4_t simde_add_s32x4(simde_int32x4_t a, simde_int32x4_t b);

SIMDE_FUNCTION_ATTRIBUTES
simde_int32x2_t simde_add_s32x2(simde_int32x2_t a, simde_int32x2_t b) {
  #if SIMD_MIN_VECTOR_SIZE_GE(128)
     // see #1025 
     return simde_fast_narrow_s32x4(
       simde_add_s32x4(
         simde_fast_widen_s32x2(a),
         simde_fast_widen_s32x2(b)
      )
    );
  #else 
    simde_int32x2_private a_ = simde_int32x2_to_private(a);
    simde_int32x2_private b_ = simde_int32x2_to_private(b);
    simde_int32x2_private r_;
    #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      r_ = vadd_s32(a_.neon_i32, b_.neon_i32);
    #else 
      SIMDE_VECTORIZE
      for (size_t i = 0; i < sizeof(a_.values) / sizeof(a_.values[0]); i++) {
        r_.values[i] = simde_add_s32(a_.values[i], b_.values[i]);
      }
    #endif 
    return simde_int32x2_from_private(r_);
  #endif
}

SIMDE_FUNCTION_ATTRIBUTES
simde_int32x4_t simde_add_s32x4(simde_int32x4_t a, simde_int32x4_t b) {
    simde_int32x4_private a_ = simde_int32x4_to_private(a);
    simde_int32x4_private b_ = simde_int32x4_to_private(b);
    simde_int32x4_private r_;

   #if SIMDE_MIN_VECTOR_SIZE_LT(128)
      r_.s32x2[0] = simde_add_s32x2(a_.s32x2[0], b_.s32x2[0]);
      r_.s32x2[1] = simde_add_s32x2(a_.s32x2[1], b_.s32x2[1]);
    #else
      // all the 128-bit vector stuff here
    #endif
}

// repeat for s32x8, s32x16

I am mostly proposing this because removing MMX will need a massive rewrite anyways, so if any large changes are to be made it would be the best time to do it, and we might as well try to reap the benefits of widening 64-bit vectors on all 128-bit only platforms.

Obviously this can be a gradual change.

@easyaspi314 Hi, we're looking into implementing Hellium/MVE (ARM v8.1-M M-profile Vector Extension) - at least parts of those and eventually whole extension. You're idea looks to be really helpful here, since huge part of this extension overlaps with ARMv8 A-profile neon intrinsics. It could share both the implementation and tests with neon implementation. Would you be interested in providing guidance and review in this part of the task? My initial idea was adding ACLE (Arm C Language Extension) directory for common Neon and Hellium parts, but seeing that there is more potential of sharing code with other ISA, it could be done that way. My only concern about this approach is how would unit tests look, if we've gone this route? It's pretty important for us to be able to test all codepaths no matter the architecture. We care about code shareing between ARMv8-A/ARMv8-M and x86-64 to degree possible as our main goal, but we're aware our code base might be used on RISC-V/ARMv9 (SVE) too.

simd-everywhere / simde

Share code between implementations #1051