Open gyuminb opened 11 months ago
@llvm/issue-subscribers-backend-powerpc
Author: None (gyuminb)
I've taken a quick look at the assembly for this test.
For O2 we produce the following code to compute the MIN for computedResultUll
.
lwz 4, 100(1) # 4-byte Folded Reload
extsh 3, 10
cmpw 3, 4
isellt 4, 3, 4
addis 3, 2, computedResultUll@toc@ha
std 4, computedResultUll@toc@l(3)
For O3 we produce the following code:
extsh 4, 21
extsh 3, 3
cmpw 3, 4
isellt 4, 3, 4
addis 3, 2, computedResultUll@toc@ha
std 4, computedResultUll@toc@l(3)
For O1
lhax 17, 5, 17
<... lots of code in here ...>
extsh 3, 11
addis 4, 2, computedResultUll@toc@ha
cmpw 3, 17
addi 4, 4, computedResultUll@toc@l
isellt 3, 3, 17
std 3, 0(4)
The bottom line is that for -O2
we seem to use a load with zero extend which is why we have the different result.
Looks like we are changing LHAX to LWZ in the register allocator. Is this an attempt to reduce live range?
# *** IR Dump Before Greedy Register Allocator (greedy) ***:
<...>
1968B bb.3.for.cond.for.cond.cleanup_crit_edge:
; predecessors: %bb.6
successors: %bb.4(0x80000000); %bb.4(100.00%)
1984B %223:gprc_and_gprc_nor0 = EXTSH %4:gprc
2000B %224:crrc = CMPW %223:gprc_and_gprc_nor0, %222:gprc_and_gprc_nor0
2016B undef %232.sub_32:g8rc = ISEL %223:gprc_and_gprc_nor0, %222:gprc_and_gprc_nor0, %224.sub_lt:crrc
2048B %226:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @computedResultUll
2064B STD %232:g8rc, target-flags(ppc-toc-lo) @computedResultUll, %226:g8rc_and_g8rc_nox0, implicit $x2 :: (store (s64) into @computedResultUll, !tbaa !9)
<... Different BB ... >
2544B %222:gprc_and_gprc_nor0 = LHAX %78:g8rc_and_g8rc_nox0, %158:g8rc :: (load (s16) from %ir.arrayidx, !tbaa !5)
# *** IR Dump After Greedy Register Allocator (greedy) ***:
<...>
2000B bb.3.for.cond.for.cond.cleanup_crit_edge:
; predecessors: %bb.6
successors: %bb.4(0x80000000); %bb.4(100.00%)
2016B %223:gprc_and_gprc_nor0 = EXTSH %4:gprc
2024B %285:gprc_and_gprc_nor0 = LWZ 0, %stack.5 :: (load (s32) from %stack.5)
2032B %224:crrc = CMPW %223:gprc_and_gprc_nor0, %285:gprc_and_gprc_nor0
2048B undef %232.sub_32:g8rc = ISEL %223:gprc_and_gprc_nor0, %285:gprc_and_gprc_nor0, %224.sub_lt:crrc
2080B %226:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @computedResultUll
2096B STD %232:g8rc, target-flags(ppc-toc-lo) @computedResultUll, %226:g8rc_and_g8rc_nox0, implicit $x2 :: (store (s64) into @computedResultUll, !tbaa !9)
<... Different BB ... >
2608B %286:gprc_and_gprc_nor0 = LHAX %271:g8rc_and_g8rc_nox0, %158:g8rc :: (load (s16) from %ir.arrayidx, !tbaa !5)
2616B STW %286:gprc_and_gprc_nor0, 0, %stack.5 :: (store (s32) into %stack.5)
Should that be an STD and LD instead of STW and LWZ?
@gyuminb Did you check that the code is UB free using UBSan? Seems like for (short index = 3; index < ((int) (short) largeNumber) - 1705/*7*/; index += 4) {
might be converting a unsigned value to a signed value that doesn't fit in a short
.
@gyuminb Did you check that the code is UB free using UBSan? Seems like
for (short index = 3; index < ((int) (short) largeNumber) - 1705/*7*/; index += 4) {
might be converting a unsigned value to a signed value that doesn't fit in ashort
.
Yes, when I checked the code using the -fsanitize=undefined option, there were no instances of undefined behavior (UB).
This is a bug in PPCMIPeepholes. It is quite subtle, but a bug nonetheless. This is probably why it requires all the complexity in the test case. Here's the gist of it:
i16
that are inputs to a select_cc
which is then sign extended to i64
EXTSH
and the result is extended with EXTSW_32_64
EXTSW_32_64
is fed by an operation that extends 32 to 64 and because the two EXTSH
(or whatever they're transformed to) are marked as such, the EXTSW32_64
is converted to an INSERT_SUBREG
EXTSH
(or the LHA
that it is converted to) and because it is a GPRC
register, it is spilled (and reloaded) using STW/LWZ
Ultimately, we have to remove all the SExt32To64
decorations from $LLVM_SRC/lib/Target/PowerPC/PPCInstrInfo.td
because when they are spilled, they will be reloaded using a zero-extending load.
In order to prevent lost opportunities to remove redundant sign-extend instructions, perhaps the peephole can be modified to look at the use of instructions that sign-extend to 32-bits and if all the uses will then sign-extend, then convert the instruction to a sign-extend to 64-bits and update the uses to 64-bit uses. But that's a bit more involved.
I compare the asm code. the different is at
the error output
r10 has the vaule of shortArray[3], when it store to stack, it only store 4 byte.
lha 10, 6(31) 6(31) is shortArray[3] ,
.....
stw 10, 124(1)
# 4-byte Folded Spill,
when it load the value from stack , it only load 4 bytes
L..BB0_11: # %for.cond.cleanup15
lwz 4, 124(1) # 4-byte Folded Reload , shortArray[3] = 0xFFFF,
extsh 3, 12 # globalShortValue is 1
cmpw 3, 4.
isellt 4, 3, 4
ld 3, L..C16(2) # @computedResultUll
std 4, 0(3) # store r4 to computedResultUll
ld 3, L..C2(2) # @_MergedGlobals
bl .printf[PR]
the correct one:
lha 4, 6(29) # @shortArray[3]
lha 3, 0(28) # content of @globalShortValue
cmpw 3, 4
isellt 30, 3, 4 int tmp in r30
std 30, 0(11) store R30(tmp) into computedResultUll
std 30, 112(1) # 8-byte Folded Spill int tmp in r30 , 112(1)
L..BB0_11: # %for.cond.cleanup16
ld 3, 120(1) # 8-byte Folded Reload
ld 4, 112(1) # 8-byte Folded Reload , shortArray[3]
addi 3, 3, 4
bl .printf[PR]
The bug only happen in 64bit mode. In the PPCMIPeepholes optimization , if there are a instruction extsw
(EXTSW_32_64), it may be eliminated after optimization when the following condition met.
extsw RA,RS
is already a signed extend instruction. for example:
LHA 4, 2(2) # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF in 64bit mode
EXTSW 5, 4 # so the r5 will 0xFFFFFFFFFFFFFFFF.
ADDI 6, 5, 3 # the content of r6 is 2.
LHA will be lower to lha
in instruction selection, since lha
is 16bit -> 64bit signed extend in 64 bit mode application, after the PPCMIPeepholes, the code will change to
LHA 4, 2(2) # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. ADDI 6, 4, 3 # the content of r6 is 2.
2 . the instruction defined the register RS which is used by extsw RA,RS
is not a signed extend instruction. but the content of register RS defined in the instruction is 64 bit signed extend(which is deduced by the logic flow of the instructions)
for example:
LHA 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
EXTSH 3, 12 # if the content in r12 is 1.
CMPW 3, 4. # compare r3 and r4
ISEL 4, 3, 4 # r4 will be 0xFFFFFFFFFFFFFFFF
EXTSW 5,4 # r5 will be 0xFFFFFFFFFFFFFFFF
ADDI 6, 5,3 # r6 will be 2
in the scenario, the extsw
still can be eliminated by PPCMIPeepholes even if the instruction isellt
is not signed extended instruction, but the Registers which used in isellt
is defined by 64 signed extended instruction: lha
and extsh
the code can be optimize to
lha 4, 2(2) # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
extsh 3, 12 # if the content in r12 is 1.
cmpw 3, 4. # compare r3 and r4
isellt 4, 3, 4 # r4 will be 0xFFFFFFFFFFFFFFFF
addi 6, 4,3 # r6 will be 2
But in some special situation, there is the problem when there is spill happen. In the example 1 (which is snippet code of a function in 64bit mode, which has spill on r4),
LHA 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
ADDI 6, 4, 3 # the content of r6 is 2 =3 + (-1) .
will change to following code after spill. it spill r4 into memory (4 bytes) with STW
since the LHA is 32 bit pseudo instruction, the length of Register is 32 bit,
LHA 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
STW 4 8(1) # spill r4 to memory 8(1), since the LHA is 32 bit pseudo instruction, the length of Register is 32 bit,
# it will store the 0xFFFFFFFF to 8(1).
....
LWZ 4, 8(1) # reload r4 from memory, the r4 will 0x00000000FFFFFFFF
ADDI 6, 4, 3 # the content of r6 is 3+0x00000000FFFFFFFF, not 2 any more.
to fix the problem , we need to promote the LHA to LHA8 which is 64-bit REGISTER
LHA 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
EXTSW 5, 4 # so the r5 will 0xFFFFFFFFFFFFFFFF.
ADDI 6, 5, 3 # the content of r6 is 2.
will be changed to
LHA8 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
ADDI 6, 4, 3 # the content of r6 is 2.
if there is a spill between the LHA8 4, 2(2)
and ADDI 6, 4, 3
, the code will be change to following after spill , it spill r4 into memory (8 bytes) with STD
since the LHA8 is 64 bit pseudo instruction, the length of Register is 64 bit,
LHA8 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
std 4 8(1) # spill r4 to memory 8(1), since the LHA8 is 64 bit instruction, the length of Register is 64 bit,
# it will store the 0xFFFFFFFFFFFFFFFF to 8(1).
....
ld 4, 8(1) # reload r4 from memory, the r4 will 0xFFFFFFFFFFFFFFF
ADDI 6, 4, 3 # the content of r6 is 2
in example 2:
LHA 4, 2(2) # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF.
EXTSH 3, 12 # if the content in r12 is 1.
CMPW 3, 4. # compare r3 and r4
ISEL 4, 3, 4 # r4 will be 0xFFFFFFFFFFFFFFFF
EXTSW 5,4 # r5 will be 0xFFFFFFFFFFFFFFFF
ADDI 6, 5,3 # r6 will be 2
we do not know when the spill will be happen ,All these instructions in the chain used to deduce sign extension to eliminate the 'extsw' will need to be promoted to 64-bit pseudo instructions. We need to promote the EXTSH, LHA, ISEL
to EXTSH8, LHA8, ISEL8
Description:
The Proof-of-Concept (PoC) code provided demonstrates an inconsistency in the computed results of an
unsigned long long int
variable when compiled using Clang-18 for the PowerPC64 architecture. The discrepancy is observed specifically under the optimization levels-O1
and-O2
. The output ofcomputedResultUll
displays inconsistency as shown below:Environment:
O1
andO2
optimization levels.PoC:
Expected Behavior:
Regardless of the optimization level, the value of
computedResultUll
should be consistently and accurately computed as anunsigned long long int
.Observed Behavior:
When compiled with Clang-18 under
-O1
and-O2
optimization levels, the computed value forcomputedResultUll
shows inconsistency:Analysis:
The inconsistency is identified under the following conditions:
unsigned long long int
.These conditions are specific and intricate, but the inconsistency is notable and is not attributed to Undefined Behavior.
Steps to Reproduce:
O1
andO2
optimization levels.computedResultUll
.Conclusion:
The observed inconsistency in extending values to
unsigned long long int
, under specific conditions involving complex loop structures and type casting operations, when compiled using Clang-18 at-O1
and-O2
optimization levels, warrants further investigation and resolution.