With -msimd128 -O3, I would expect a __builtin_shufflevector which returns half the elements plus a __builtin_convertvector to extend each element (resulting in a 128-bit vector) to generate a v128.shuffle and an extend_low. Instead, it generates a bunch of extract_lane and replace_lane instructions.
I think it's pretty reasonable to expect that foo and foo_intrin should generate roughly the same code (the upper half of the shuffle doesn't matter, so maybe all zeros or something).
I'd be very impressed, OTOH, if bar and bar_intrin generated the same code. I'm not sure how feasible that is, though.
After a recent spate of commits, these examples are no longer scalarized, but they still all generate very different code from one another. PTAL at the latest lowerings and let me know if they seem reasonable to you.
Extended Description
With -msimd128 -O3, I would expect a __builtin_shufflevector which returns half the elements plus a __builtin_convertvector to extend each element (resulting in a 128-bit vector) to generate a v128.shuffle and an extend_low. Instead, it generates a bunch of extract_lane and replace_lane instructions.
Here are a couple of quick examples (Compiler Explorer: https://godbolt.org/z/EjbMqPhx1):
include
pragma clang diagnostic ignored "-Wmissing-prototypes"
typedef int8_t i8x16 attribute((vector_size__(16))); typedef int16_t i16x8 attribute((vector_size(16))); typedef int32_t i32x4 attribute((__vector_size(16))); typedef uint8_t u8x16 attribute((vector_size__(16))); typedef uint16_t u16x8 attribute((vector_size(16))); typedef uint32_t u32x4 attribute((__vector_size(16)));
i16x8 foo(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector(a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 ); }
v128_t foo_intrin(v128_t a) { return wasm_i16x8_extend_low_i8x16( wasm_i8x16_shuffle(a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15) ); }
i16x8 bar(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector( a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 )
}
i16x8 bar_intrin(v128_t a) { v128_t shuffled = wasm_i8x16_shuffle( a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 ); return wasm_i16x8_extend_low_i8x16(shuffled) - wasm_i16x8_extend_high_i8x16(shuffled); }
I think it's pretty reasonable to expect that foo and foo_intrin should generate roughly the same code (the upper half of the shuffle doesn't matter, so maybe all zeros or something).
I'd be very impressed, OTOH, if bar and bar_intrin generated the same code. I'm not sure how feasible that is, though.