llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.39k stars 12.15k forks source link

[SIMD] __builtin_shufflevector to 64-bit vector then extending not vectorized #50151

Open llvmbot opened 3 years ago

llvmbot commented 3 years ago
Bugzilla Link 50807
Version trunk
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @tlively

Extended Description

With -msimd128 -O3, I would expect a __builtin_shufflevector which returns half the elements plus a __builtin_convertvector to extend each element (resulting in a 128-bit vector) to generate a v128.shuffle and an extend_low. Instead, it generates a bunch of extract_lane and replace_lane instructions.

Here are a couple of quick examples (Compiler Explorer: https://godbolt.org/z/EjbMqPhx1):

include

pragma clang diagnostic ignored "-Wmissing-prototypes"

typedef int8_t i8x16 attribute((vector_size__(16))); typedef int16_t i16x8 attribute((vector_size(16))); typedef int32_t i32x4 attribute((__vector_size(16))); typedef uint8_t u8x16 attribute((vector_size__(16))); typedef uint16_t u16x8 attribute((vector_size(16))); typedef uint32_t u32x4 attribute((__vector_size(16)));

i16x8 foo(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector(a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 ); }

v128_t foo_intrin(v128_t a) { return wasm_i16x8_extend_low_i8x16( wasm_i8x16_shuffle(a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15) ); }

i16x8 bar(i8x16 a) { return __builtin_convertvector( __builtin_shufflevector( a, a, 0, 2, 4, 6, 8, 10, 12, 14 ), i16x8 )

    __builtin_convertvector(
        __builtin_shufflevector(
            a, a,
            1, 3, 5, 7, 9, 11, 13, 15
        ),
        i16x8
    );

}

i16x8 bar_intrin(v128_t a) { v128_t shuffled = wasm_i8x16_shuffle( a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 ); return wasm_i16x8_extend_low_i8x16(shuffled) - wasm_i16x8_extend_high_i8x16(shuffled); }

I think it's pretty reasonable to expect that foo and foo_intrin should generate roughly the same code (the upper half of the shuffle doesn't matter, so maybe all zeros or something).

I'd be very impressed, OTOH, if bar and bar_intrin generated the same code. I'm not sure how feasible that is, though.

tlively commented 3 years ago

After a recent spate of commits, these examples are no longer scalarized, but they still all generate very different code from one another. PTAL at the latest lowerings and let me know if they seem reasonable to you.