llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.83k stars 11.46k forks source link

Arm regression for three way min/max #35223

Open 4c958a03-4ad7-407c-a85b-320d9a7221b0 opened 6 years ago

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 6 years ago
Bugzilla Link 35875
Version trunk
OS Linux
Attachments Reproducer
CC @topperc,@DimitryAndric,@RKSimon,@rotateright

Extended Description

This is llvm/llvm-project#35065 , part 2.

The regression is fixed for x86, we just need to do the same for Arm.

This is a piece of code that at it's heart does a three way min (also attached):

target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64" target triple = "thumbv8m.base-arm-none-eabi" define void @​test(i8 nocapture readonly, i8 nocapture) { %3 = getelementptr inbounds i8, i8 %0, i32 1 %4 = load i8, i8 %0, align 1 %5 = getelementptr inbounds i8, i8 %0, i32 2 %6 = load i8, i8 %3, align 1 %7 = load i8, i8 %5, align 1 %8 = xor i8 %4, -1 %9 = xor i8 %6, -1 %10 = xor i8 %7, -1 %11 = zext i8 %8 to i32 %12 = zext i8 %9 to i32 %13 = icmp ult i8 %6, %4 %14 = zext i8 %10 to i32 %15 = icmp ult i32 %11, %14 %16 = select i1 %15, i32 %11, i32 %14 %17 = icmp ult i32 %12, %14 %18 = select i1 %17, i32 %12, i32 %14 %19 = select i1 %13, i32 %16, i32 %18 %20 = trunc i32 %19 to i8 %21 = sub nsw i32 %11, %19 %22 = trunc i32 %21 to i8 %23 = sub nsw i32 %12, %19 %24 = trunc i32 %23 to i8 %25 = sub nsw i32 %14, %19 %26 = trunc i32 %25 to i8 %27 = getelementptr inbounds i8, i8 %1, i32 1 store i8 %20, i8 %1, align 1 %28 = getelementptr inbounds i8, i8 %1, i32 2 store i8 %22, i8 %27, align 1 %29 = getelementptr inbounds i8, i8 %1, i32 3 store i8 %24, i8 %28, align 1 store i8 %26, i8 %29, align 1 ret void }

Because i8 isn't a legal type we don't remove the zext/trunc's. And because the negated values are also stored we do not fold the 3 selects into 2.

rotateright commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#36036

rotateright commented 6 years ago

Let's test the limits of instcombine theory :) https://reviews.llvm.org/D44266

rotateright commented 6 years ago

Nice idea! Any plan for how we would get out of the local minimum here? With all these multiple uses?

It's a struggle. If there's no way to do it without tracking the min/max users to know that we can kill off all of the 'not' instructions, then this might be a job for -aggressive-instcombine.

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 6 years ago

Nice idea! Any plan for how we would get out of the local minimum here? With all these multiple uses?

rotateright commented 6 years ago

The canonical IR that we end up with if we can do this will have only 1 'not':

define void @​max(i8 %ld0, i8 %st0) { %ld1 = getelementptr inbounds i8, i8 %ld0, i64 1 %ld2 = getelementptr inbounds i8, i8 %ld0, i64 2 %x = load i8, i8 %ld0, align 1 %y = load i8, i8 %ld1, align 1 %z = load i8, i8* %ld2, align 1

; invert mins to maxs

%1 = icmp ugt i8 %x, %z %2 = select i1 %1, i8 %x, i8 %z %3 = icmp ugt i8 %2, %y %4 = select i1 %3, i8 %2, i8 %y

; the min is still used, so we need to generate it via 'not' %minxyz = xor i8 %4, -1

; reverse subtractions %xmin = sub i8 %4, %x %ymin = sub i8 %4, %y %zmin = sub i8 %4, %z

%st1 = getelementptr inbounds i8, i8 %st0, i64 1 %st2 = getelementptr inbounds i8, i8 %st0, i64 2 %st3 = getelementptr inbounds i8, i8 %st0, i64 3 store i8 %minxyz, i8 %st0, align 1 store i8 %xmin, i8 %st1, align 1 store i8 %ymin, i8 %st2, align 1 store i8 %zmin, i8* %st3, align 1 ret void }

rotateright commented 6 years ago

The min/max corollary of that: https://rise4fun.com/Alive/lOL

In the most basic case, that's not min/max dependent. It's just this: https://rise4fun.com/Alive/m3q

rotateright commented 6 years ago

I thought we had taken IR canonicalization as far as we could, but we missed one: https://reviews.llvm.org/rL326660

The min/max corollary of that: https://rise4fun.com/Alive/lOL

Justifying that with the multi-uses in this example may be a challenge, but I think we can do it...

rotateright commented 6 years ago

Here's where we are in optimized IR after the previous improvements:

define void @​min(i8 %ld0, i8 %st0) { %ld1 = getelementptr inbounds i8, i8 %ld0, i64 1 %ld2 = getelementptr inbounds i8, i8 %ld0, i64 2

%x = load i8, i8 %ld0, align 1 %y = load i8, i8 %ld1, align 1 %z = load i8, i8* %ld2, align 1

%xn = xor i8 %x, -1 %yn = xor i8 %y, -1 %zn = xor i8 %z, -1 %cmpxz = icmp ult i8 %xn, %zn %minxz = select i1 %cmpxz, i8 %xn, i8 %zn %cmpxyz = icmp ult i8 %minxz, %yn %minxyz = select i1 %cmpxyz, i8 %minxz, i8 %yn %xmin = sub i8 %xn, %minxyz %ymin = sub i8 %yn, %minxyz %zmin = sub i8 %zn, %minxyz

%st1 = getelementptr inbounds i8, i8 %st0, i64 1 %st2 = getelementptr inbounds i8, i8 %st0, i64 2 %st3 = getelementptr inbounds i8, i8* %st0, i64 3

store i8 %minxyz, i8 %st0, align 1 store i8 %xmin, i8 %st1, align 1 store i8 %ymin, i8 %st2, align 1 store i8 %zmin, i8 %st3, align 1 ret void }

rotateright commented 6 years ago

Here's the value tracking enhancement: https://reviews.llvm.org/rL322283

rotateright commented 6 years ago

Getting the 3 selects down to 2 should happen with the value tracking enhancement that I was originally planning in bug 35834.

We managed to grind the simpler case down without needing it, but this example with extra uses shows the shortcoming in not recognizing min/max in that form:

%nx = xor i8 %x, -1 %ny = xor i8 %y, -1 %nz = xor i8 %z, -1 %cmpxz = icmp ult i8 %nx, %nz %minxz = select i1 %cmpxz, i8 %nx, i8 %nz %cmpyz = icmp ult i8 %ny, %nz %minyz = select i1 %cmpyz, i8 %ny, i8 %nz %cmpyx = icmp ult i8 %y, %x %r = select i1 %cmpyx, i8 %minxz, i8 %minyz => %cmpxyz = icmp ult i8 %minxz, %ny %r = select i1 %cmpxyz, i8 %minxz, i8 %ny

https://rise4fun.com/Alive/uHV

rotateright commented 6 years ago

As noted in bug 35717, there's a question about shouldChangeType(): should we make an exception for i8/i16 because those are common types in source code?

There's a related question in bug 35792, comment 10.

We can reduce a test for the narrowing problem down to:

define i8 @​narrow_sub(i8 zeroext %x, i8 zeroext %y) { %x32 = zext i8 %x to i32 %y32 = zext i8 %y to i32 %sub = sub nsw i32 %x32, %y32 %tr = trunc i32 %sub to i8 ret i8 %tr }


If i8 is not legal via the datalayout, then instcombine does nothing with that.

But if we have legal i8 as on x86:

$ ./opt -instcombine -data-layout="n8:32" 35875.ll -S define zeroext i8 @​narrow_sub(i8 zeroext %x, i8 zeroext %y) { %sub = sub i8 %x, %y ret i8 %sub }

...we narrow the binop.

The asm output for the specified thumb target looks the same either way if we have 'zeroext' on the params and return value:

$ ./opt -instcombine -data-layout="n8:32" 35875.ll -S | ./llc -o - -mtriple=thumbv8m.base-arm-none-eabi

subs    r0, r0, r1
uxtb    r0, r0
bx  lr

If we don't have (the expected?) 'zeroext' on the params and return value, the IR with 32-bit sub has extra instructions to extend the params: uxtb r1, r1 uxtb r0, r0 subs r0, r0, r1 bx lr