Zhenhang1213 commented 1 month ago

I find results in armebv7-linux are different from armv7. code

#include <stdio.h> 
volatile int one = 1;

int main (int argc, char *argv[]) {
    __attribute__((vector_size((8)*sizeof(short)))) short v0 = {one, 1, 2, 3, 4, 5, 6, 7};
    __attribute__((vector_size((8)*sizeof(short)))) short v1;
    v1 = v0 % 2; 
    for (int __i = 0; __i < 8; __i++) { 
        printf("v1: %d\n", (*((short *) &(v1) + __i))); 
    }
    return 0;
}

When __i is 4, the result is different. I think this Instruction: vrev32.16 d17, d18 is wrong, and I try to modify it is vmov.16 d17, d18, the result is right. However I don't know how to fix this bug.

https://godbolt.org/z/EqYebY73z

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-arm

Author: Austin (Zhenhang1213)

I find results in armebv7-linux are different from armv7. code ''' #include <stdio.h> volatile int one = 1; int main (int argc, char *argv[]) { __attribute__((vector_size((8)*sizeof(short)))) short v0 = {one, 1, 2, 3, 4, 5, 6, 7}; __attribute__((vector_size((8)*sizeof(short)))) short v1; v1 = v0 % 2; for (int __i = 0; __i < 8; __i++) { printf("v1: %d\n", (*((short *) &(v1) + __i))); } return 0; } ''' When __i is 4, the result is different. I think this Instruction: **vrev32.16 d17, d18** is wrong, and I try to modify it is **vmov.16 d17, d18**, the result is right. However I don't know how to fix this bug. https://godbolt.org/z/EqYebY73z

davemgreen commented 1 month ago

7WYn38xTs.

Zhenhang1213 commented 1 month ago

97782 recently fixed a similar sounding issue, but it doesn't look like the code has changed on trunk https://godbolt.org/z/7WYn38xTs.

yes, I compile clang with this patch, and the code is same

hstk30-hw commented 1 month ago

Same problem, but different situation. The vrev32.16 d17, d18 is unnecessary,

vmov.i32        d18, #0x10000
vrev32.16 d17, d18

repalce to

vmov.i32        d17, #0x10000

is ok.

Zhenhang1213 commented 1 month ago

Same problem, but different situation. The vrev32.16 d17, d18 is unnecessary,
vmov.i32        d18, #0x10000
vrev32.16 d17, d18
repalce to
vmov.i32        d17, #0x10000
is ok.

@davemgreen could you give me some ideas? I've been puzzled by this question for a long time.

davemgreen commented 1 month ago

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat). https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

Zhenhang1213 commented 1 month ago

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat). https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

However，I find llvm 15 also is wrong without this patch

Zhenhang1213 commented 3 weeks ago

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat). https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

https://github.com/llvm/llvm-project/blob/6f6422f4a2b8647a59936c131e50a79906d89510/llvm/lib/Target/ARM/ARMISelLowering.cpp#L7966-L7969

Is this code for endianness adjustment? In this scenario, should I adjust the lower part and adjust byte order finally after CONCAT_VECTORS?

davemgreen commented 3 weeks ago

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

Zhenhang1213 commented 3 weeks ago

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

yes，I try to mody it and some tests failed， I think there are some problems with the modified behavior, such as this case

; RUN: llc -mtriple armeb-eabi -mattr=armv8.2-a,neon,fullfp16 -target-abi=aapcs-gnu -float-abi hard -o - %s | FileCheck %s
define void @conv_v4i16_to_v4f16( <4 x i16> %a, <4 x half>* %store ) {
; CHECK-LABEL: conv_v4i16_to_v4f16:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vmov.i64 d16, #0xffff00000000ffff
; CHECK-NEXT:    vldr d17, [r0]
; CHECK-NEXT:    vrev64.16 d18, d0
; CHECK-NEXT:    vrev64.16 d17, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vadd.i16 d16, d18, d16
; CHECK-NEXT:    vadd.f16 d16, d16, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vstr d16, [r0]
; CHECK-NEXT:    bx lr
entry:
  %c = add <4 x i16> %a, <i16 -1, i16 0, i16 0, i16 -1>
  %v = bitcast <4 x i16> %c to <4 x half>
  %w = load <4 x half>, <4 x half>* %store
  %z = fadd <4 x half> %v, %w
  store <4 x half> %z, <4 x half>* %store
  ret void
}

and the result is

 vmov.i64        d16, #0xffff00000000ffff
        vldr    d17, [r0]
        vrev64.16       d18, d0
        vadd.i16        d16, d18, d16
        vrev64.16       d17, d17
        vadd.f16        d16, d16, d17
        vrev64.16       d16, d16
        vstr    d16, [r0]
        bx      lr

The logic of the two parts is not the same. When I debug PerformVECTOR_REG_CASTCombine function， Op->getOpcode() isnot ARMISD::VECTOR_REG_CAST, so return SDValue();

Zhenhang1213 commented 3 weeks ago

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

After adding the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering, cancel the adaptation for 64-bit VMOV of BigEndian in isVMOVModifiedImm

entry:
  %0 = bitcast <4 x i32> %src to <16 x i8>
  %r = and <16 x i8> %0, <i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1>
  ret <16 x i8> %r

and the instruction generated I think is incorrect，


vrev64.8        q1, q0
vmov.i32        q0, #0x1
vrev32.8        q0, q0
vand            q1, q1, q0
vrev64.8        q0, q1
bx              lr

could you give me more advice? Modifying BITCAST to VECTOR_REG_CAST) brings more failures, thx.

davemgreen commented 3 weeks ago

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in https://github.com/llvm/llvm-project/blob/dd90c72b05822927bf62724759f73c491166248c/llvm/lib/Target/ARM/ARMISelLowering.cpp#L7124, that should be removed so we are consistent about which order lanes appear.

Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

Zhenhang1213 commented 3 weeks ago

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in

https://github.com/llvm/llvm-project/blob/dd90c72b05822927bf62724759f73c491166248c/llvm/lib/Target/ARM/ARMISelLowering.cpp#L7124

, that should be removed so we are consistent about which order lanes appear. Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

I have tried these modifications.

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 4c24d7020932..d864ba00a02b 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -7011,19 +7011,6 @@ static SDValue isVMOVModifiedImm(uint64_t SplatBits, uint64_t SplatUndef,
       ImmMask <<= 1;
     }

-    if (DAG.getDataLayout().isBigEndian()) {
-      // Reverse the order of elements within the vector.
-      unsigned BytesPerElem = VectorVT.getScalarSizeInBits() / 8;
-      unsigned Mask = (1 << BytesPerElem) - 1;
-      unsigned NumElems = 8 / BytesPerElem;
-      unsigned NewImm = 0;
-      for (unsigned ElemNum = 0; ElemNum < NumElems; ++ElemNum) {
-        unsigned Elem = ((Imm >> ElemNum * BytesPerElem) & Mask);
-        NewImm |= Elem << (NumElems - ElemNum - 1) * BytesPerElem;
-      }
-      Imm = NewImm;
-    }
-
     // Op=1, Cmode=1110.
     OpCmode = 0x1e;
     VT = is128Bits ? MVT::v2i64 : MVT::v1i64;
@@ -7831,7 +7818,7 @@ SDValue ARMTargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,

       if (Val.getNode()) {
         SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);
-        return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
+        return DAG.getNode(ARMISD::VECTOR_REG_CAST, dl, VT, Vmov);
       }

       // Try an immediate VMVN.
@@ -18292,7 +18279,6 @@ static SDValue PerformBITCASTCombine(SDNode *N,
   if ((Src.getOpcode() == ARMISD::VMOVIMM ||
        Src.getOpcode() == ARMISD::VMVNIMM ||
        Src.getOpcode() == ARMISD::VMOVFPIMM) &&
-      SrcVT.getScalarSizeInBits() <= DstVT.getScalarSizeInBits() &&
       DAG.getDataLayout().isBigEndian())
     return DAG.getNode(ARMISD::VECTOR_REG_CAST, SDLoc(N), DstVT, Src);

I test big-endian-neon-fp16-bitconv.ll failed, I think PerformVECTOR_REG_CASTCombine should be changed next? right?

Zhenhang1213 commented 3 weeks ago

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in

https://github.com/llvm/llvm-project/blob/dd90c72b05822927bf62724759f73c491166248c/llvm/lib/Target/ARM/ARMISelLowering.cpp#L7124

, that should be removed so we are consistent about which order lanes appear. Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

After this patch, after vmov, all vrev will be cancelled, so those tests will failed, like this

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple armeb-eabi -mattr=armv8.2-a,neon,fullfp16 -target-abi=aapcs-gnu -float-abi hard -o - %s | FileCheck %s
define void @conv_v4i16_to_v4f16( <4 x i16> %a, <4 x half>* %store ) {
; CHECK-LABEL: conv_v4i16_to_v4f16:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vmov.i64 d16, #0xffff00000000ffff
; CHECK-NEXT:    vldr d17, [r0]
; CHECK-NEXT:    vrev64.16 d18, d0
; CHECK-NEXT:    vrev64.16 d17, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vadd.i16 d16, d18, d16
; CHECK-NEXT:    vadd.f16 d16, d16, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vstr d16, [r0]
; CHECK-NEXT:    bx lr
entry:
  %c = add <4 x i16> %a, <i16 -1, i16 0, i16 0, i16 -1>
  %v = bitcast <4 x i16> %c to <4 x half>
  %w = load <4 x half>, <4 x half>* %store
  %z = fadd <4 x half> %v, %w
  store <4 x half> %z, <4 x half>* %store
  ret void
}

the result is:

vmov.i64        d16, #0xffff00000000ffff
vldr    d17, [r0]
vrev64.16       d18, d0
vadd.i16        d16, d18, d16
vrev64.16       d17, d17
vadd.f16        d16, d16, d17
vrev64.16       d16, d16
vstr    d16, [r0]
bx      lr

davemgreen commented 3 weeks ago

Is that incorrect now, or was that vrev64.16 unnecessary?

The PerformBITCASTCombine part probably needs to be more restrictive, not less. If there is:

%a = v2f64 vmovimm
%b = v16i8 vector_reg_cast %a
%c = v2i64 bitcast %b

We need to keep the v16i8->v2i64 bitcast to keep the lane shuffling, even if the original type elt size is <= the final type elt size.

Zhenhang1213 commented 3 weeks ago

Is that incorrect now, or was that vrev64.16 unnecessary?

The PerformBITCASTCombine part probably needs to be more restrictive, not less. If there is:
%a = v2f64 vmovimm
%b = v16i8 vector_reg_cast %a
%c = v2i64 bitcast %b
We need to keep the v16i8->v2i64 bitcast to keep the lane shuffling, even if the original type elt size is <= the final type elt size.

Compared to the original case，vrev64.16 is unnecessary, in this case, the result is same https://godbolt.org/z/zYv866sMY

llvm / llvm-project

[llvm] Deal vector data failed when using option Os、-mfloat-abi=softfp、--target=armebv7-linux #102418

97782 recently fixed a similar sounding issue, but it doesn't look like the code has changed on trunk https://godbolt.org/z/7WYn38xTs.

97782 recently fixed a similar sounding issue, but it doesn't look like the code has changed on trunk https://godbolt.org/z/7WYn38xTs.