Bugzilla Link	50807
Version	trunk
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@tlively

Extended Description

With -msimd128 -O3, I would expect a __builtin_shufflevector which returns half the elements plus a __builtin_convertvector to extend each element (resulting in a 128-bit vector) to generate a v128.shuffle and an extend_low. Instead, it generates a bunch of extract_lane and replace_lane instructions.

Here are a couple of quick examples (Compiler Explorer: https://godbolt.org/z/EjbMqPhx1):

include

pragma clang diagnostic ignored "-Wmissing-prototypes"

typedef int8_t i8x16 attribute((vector_size__(16))); typedef int16_t i16x8 attribute((vector_size(16))); typedef int32_t i32x4 attribute((__vector_size(16))); typedef uint8_t u8x16 attribute((vector_size__(16))); typedef uint16_t u16x8 attribute((vector_size(16))); typedef uint32_t u32x4 attribute((__vector_size(16)));

i16x8 foo(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector(a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 ); }

v128_t foo_intrin(v128_t a) { return wasm_i16x8_extend_low_i8x16( wasm_i8x16_shuffle(a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15) ); }

i16x8 bar(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector( a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 )

    __builtin_convertvector(
        __builtin_shufflevector(
            a, a,
            1, 3, 5, 7, 9, 11, 13, 15
        ),
        i16x8
    );

}

i16x8 bar_intrin(v128_t a) { v128_t shuffled = wasm_i8x16_shuffle( a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 ); return wasm_i16x8_extend_low_i8x16(shuffled) - wasm_i16x8_extend_high_i8x16(shuffled); }

I think it's pretty reasonable to expect that foo and foo_intrin should generate roughly the same code (the upper half of the shuffle doesn't matter, so maybe all zeros or something).

I'd be very impressed, OTOH, if bar and bar_intrin generated the same code. I'm not sure how feasible that is, though.

llvm / llvm-project

[SIMD] __builtin_shufflevector to 64-bit vector then extending not vectorized #50151

Extended Description

include

pragma clang diagnostic ignored "-Wmissing-prototypes"

i16x8 bar(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector( a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 )