llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.66k stars 11.85k forks source link

Big endian vector intrinsics are not compatible with GCC #20136

Open jmolloy opened 10 years ago

jmolloy commented 10 years ago
Bugzilla Link 19762
Version trunk
OS Linux
CC @hfinkel,@jmolloy,@TNorthover

Extended Description

During a discussion with the GCC folks, two faults in Clang (big endian) were identified:

The lane index to the lane-based vector intrinsics (such as vget_lane) is being treated as the logical lane, not the architectural lane. Richard Earnshaw has confirmed that it should be the architectural lane "as if" loaded by LDR.

The LD1 intrinsic is a user override and the compiler should not undo the LD1. The LD1 intrinsic is lowered to a normal LOAD node, so the compiler treats it like any load and ensures it acts as if the load had been performed by LDR. But LD1 should override this behaviour, and the load should be performed as if it were loaded with LD1, not LDR.

The following should be done to fix this:

Invert the LLVM-IR lane index created for all v*_lane functions.

Perform a reversal on the outcome of a vld1_ intrinsic. With this reversal, the compiler will do the right thing.

Bug 19392 (http://llvm.org/bugs/show_bug.cgi?id=19392) has been reopened for ARM64. This bug is for AArch32.

llvmbot commented 5 years ago

Coming back to the same:

so let the 128bit vector being passed be i32 elements : {a, b, c, d}

What gcc does is return {a}

Clang does:

vrev64.32 q8 10: That generates: {b, a, d, c} | d16 | d17 | And returns the last element (d17[1]) that is {c}, where it should return the second element, (d16[1]) that is {a}.

llvmbot commented 5 years ago

Indeed true James, thanks. I was just confused that vrev was also reverting the bytes inside each element, as if it was converting memory to register representation. Reading the definition again I see I misread it.

jmolloy commented 5 years ago

"""

include

int foo(int32x4_t a) { return a[0]; } """

Note that, unlike NEON intrinsics, the semantics of square bracket notation isn't defined anywhere for ARM which is why you end up with different code generated.

jmolloy commented 5 years ago

Hi,

""" We obtain: vrev64.32 q8, q0 vmov.32 r0, d17[1] bx

Where with GCC we obtain: vmov.32 r0, d0[0] bx lr """

These two sequences are equivalent. Clang reverses then reads the 3rd lane, GCC does not reverse then reads the 0th lane.

This is due to the way we represent lane indices in LLVM, and not caring enough to implement obvious fixup patterns (like rev/extract_elt(i) -> extract_elt(n-i-1)).

The rationale and design is documented here: http://llvm.org/docs/BigEndianNEON.html

Cheers,

James

llvmbot commented 5 years ago

Looking at the LLVM-IR generated with the command:

clang -emit-llvm --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o - -S -O0 -mbig-endian


For the intrinsic we obtain: define dso_local arm_aapcs_vfpcc i32 @​foo(<4 x i32> %a) #​0 { entry: %a.addr = alloca <4 x i32>, align 8 %s0 = alloca <4 x i32>, align 8 %rev0 = alloca <4 x i32>, align 8 %ret = alloca i32, align 4 %tmp = alloca i32, align 4 store <4 x i32> %a, <4 x i32> %a.addr, align 8 %0 = load <4 x i32>, <4 x i32> %a.addr, align 8 store <4 x i32> %0, <4 x i32>* %s0, align 8 %1 = load <4 x i32>, <4 x i32> %__s0, align 8 %2 = load <4 x i32>, <4 x i32> %s0, align 8 %shuffle = shufflevector <4 x i32> %1, <4 x i32> %2, <4 x i32> <i32 3, i32 2, i32 1, i32 0> store <4 x i32> %shuffle, <4 x i32>* %rev0, align 8 %3 = load <4 x i32>, <4 x i32> %__rev0, align 8 %4 = bitcast <4 x i32> %3 to <16 x i8> %5 = bitcast <16 x i8> %4 to <4 x i32> %vget_lane = extractelement <4 x i32> %5, i32 0 store i32 %vget_lane, i32 %ret, align 4 %6 = load i32, i32* %ret, align 4 store i32 %6, i32 %tmp, align 4 %7 = load i32, i32 %tmp, align 4 ret i32 %7 }

With an incorrect shufflevector there.

Where for returning the value by a[0], we obtain the code:

define dso_local arm_aapcs_vfpcc i32 @​foo2(<4 x i32> %a) #​0 { entry: %a.addr = alloca <4 x i32>, align 8 store <4 x i32> %a, <4 x i32> %a.addr, align 8 %0 = load <4 x i32>, <4 x i32> %a.addr, align 8 %vecext = extractelement <4 x i32> %0, i32 0 ret i32 %vecext }

As gcc.

llvmbot commented 5 years ago

hi, I do have an example of this bug: For the code:

include

int foo(int32x4_t a) { return vgetq_lane_s32(a, 0); }

Clang command: clang --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o - -S -O3 -mbig-endian

We obtain: vrev64.32 q8, q0 vmov.32 r0, d17[1] bx

Where with GCC we obtain: vmov.32 r0, d0[0] bx lr

That seems an intrinsic problem, as compiling the code:

include

int foo(int32x4_t a) { return a[0]; }

Clang gives the same result as gcc.

llvmbot commented 10 years ago

Hi,

Would you be able to provide a specific example (test case) that demonstrates the desired behavior? As by now, LLVM (for AArch32) generates vld1 machine instructions from the vld1 instrinsic.

Cheers, Conny