Open jmolloy opened 10 years ago
Coming back to the same:
so let the 128bit vector being passed be i32 elements : {a, b, c, d}
What gcc does is return {a}
Clang does:
vrev64.32 q8 10: That generates: {b, a, d, c} | d16 | d17 | And returns the last element (d17[1]) that is {c}, where it should return the second element, (d16[1]) that is {a}.
Indeed true James, thanks. I was just confused that vrev was also reverting the bytes inside each element, as if it was converting memory to register representation. Reading the definition again I see I misread it.
"""
int foo(int32x4_t a) { return a[0]; } """
Note that, unlike NEON intrinsics, the semantics of square bracket notation isn't defined anywhere for ARM which is why you end up with different code generated.
Hi,
""" We obtain: vrev64.32 q8, q0 vmov.32 r0, d17[1] bx
Where with GCC we obtain: vmov.32 r0, d0[0] bx lr """
These two sequences are equivalent. Clang reverses then reads the 3rd lane, GCC does not reverse then reads the 0th lane.
This is due to the way we represent lane indices in LLVM, and not caring enough to implement obvious fixup patterns (like rev/extract_elt(i) -> extract_elt(n-i-1)).
The rationale and design is documented here: http://llvm.org/docs/BigEndianNEON.html
Cheers,
James
Looking at the LLVM-IR generated with the command:
clang -emit-llvm --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o - -S -O0 -mbig-endian
With an incorrect shufflevector there.
As gcc.
Clang command: clang --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o - -S -O3 -mbig-endian
We obtain: vrev64.32 q8, q0 vmov.32 r0, d17[1] bx
Clang gives the same result as gcc.
Hi,
Would you be able to provide a specific example (test case) that demonstrates the desired behavior? As by now, LLVM (for AArch32) generates vld1 machine instructions from the vld1 instrinsic.
Cheers, Conny
Extended Description
During a discussion with the GCC folks, two faults in Clang (big endian) were identified:
The lane index to the lane-based vector intrinsics (such as vget_lane) is being treated as the logical lane, not the architectural lane. Richard Earnshaw has confirmed that it should be the architectural lane "as if" loaded by LDR.
The LD1 intrinsic is a user override and the compiler should not undo the LD1. The LD1 intrinsic is lowered to a normal LOAD node, so the compiler treats it like any load and ensures it acts as if the load had been performed by LDR. But LD1 should override this behaviour, and the load should be performed as if it were loaded with LD1, not LDR.
The following should be done to fix this:
Invert the LLVM-IR lane index created for all v*_lane functions.
Perform a reversal on the outcome of a vld1_ intrinsic. With this reversal, the compiler will do the right thing.
Bug 19392 (http://llvm.org/bugs/show_bug.cgi?id=19392) has been reopened for ARM64. This bug is for AArch32.