llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.84k stars 11.47k forks source link

[ppc] inefficient code generated for std::max(float, float) #26819

Open weiguozhi opened 8 years ago

weiguozhi commented 8 years ago
Bugzilla Link 26445
Version trunk
OS Linux
CC @echristo,@hfinkel,@rotateright

Extended Description

Compile following code with options: $ ~/llvm/obj2/bin/clang++ --target=powerpc64le-grtev4-linux-gnu -m64 -mvsx -mcpu=power8 -O2 -c -o t9.o t9.cc -fno-unroll-loops

include

float foo(float* input, int s) { float max_value = input[0]; for (int j = 1; j <= s; ++j) max_value = std::max(max_value, input[j]);

return max_value; }

I got following code for the loop body

.LBB0_2: # %for.body

=>This Inner Loop Header: Depth=1

    lfsu 0, 4(3)
    fcmpu 0, 1, 0
    isel 5, 3, 4, 0
    lwz 5, 0(5)
    mtvsrd 34, 5
    stw 5, -12(1)
    xxsldwi 13, 34, 34, 1
    xscvspdpn 1, 13
    bdnz .LBB0_2

There are several problems in this code snippet

  1. instead of compare and choose maximum float value, the generated code uses isel to choose the address of larger value, then load it into integer register, then move it to fp register, then expand it to double type.

  2. no need to store the max value to memory.

It causes one of our internal applications more than 2x slower.

llvmbot commented 8 years ago

The approach to fix this is discussed in the mailing list http://lists.llvm.org/pipermail/llvm-dev/2016-March/096901.html. I will work on its implementation after my vacation.

llvmbot commented 8 years ago

There was an important typo in the previous comment

Incorrect: the fix does change the code pattern that we get in this case.

Correct: the fix does NOT change the code pattern that we get in this case.

llvmbot commented 8 years ago

Bitcasts introduced by instCombine are due to canonicalization of load

http://lists.llvm.org/pipermail/llvm-dev/2015-January/080956.html git commit id: b778cbc0c8f1

Hal suggested that while reducing the scope of canonicalization (so this code pattern is not canonicalized) may help PPC, it probably hurts most other platforms and so that is not recommended.

I also checked the fix in http://reviews.llvm.org/D14596#45a2ee0d for llvm/llvm-project#25716 . That is a different issue and the fix does change the code pattern that we get in this case.

The right approach for this is still under discussion.

llvmbot commented 8 years ago

InstCombine, introduces bitcasts that are not needed and prevent SROA from performing select speculation. If I disable instcombine the final code for the loop body looks like this:

1c: 10 00 00 48 b 2c <_Z3fooPfi+0x2c> 20: 90 08 00 fc fmr f0,f1 24: 20 00 40 4e bdzlr 28: 04 00 00 48 b 2c <_Z3fooPfi+0x2c> 2c: 04 00 23 c4 lfsu f1,4(r3) 30: 00 08 00 fc fcmpu cr0,f0,f1 34: ec ff 80 41 blt 20 <_Z3fooPfi+0x20> 38: 90 00 20 fc fmr f1,f0 3c: e4 ff ff 4b b 20 <_Z3fooPfi+0x20>

There is still room for improvement, but we have removed all extra load, store and direct moves.

I will focus on removing bitcasts introduced by InstCombine.

llvmbot commented 8 years ago

SROA contains code for "select speculation" which is exactly what we need to fix problem (2) that I mentioned in the beginning of comment 3. We probably need to focus on cleaning up unnecessary bitcasts before and after SROA.

llvmbot commented 8 years ago

Looking at the code after inlining I see two separate problems. These two problems, with some small changes, survive all the way to the binary code.

1- Unnecessary bitcasts. This causes load to GPR and move from GPR to VSX registers. 2- Complex data flow due to use of mem address in select. This causes additional load and store instructions.

It seems easier to start from (1). The bitcasts that we see in the code after inlining are generated earlier than that (apparently after Combine redundant instructions). The code does not really need those bitcasts and "ideally" should look like this (after inlining)

for.body: ; preds = %for.cond %idxprom = sext i32 %j.0 to i64 %arrayidx1 = getelementptr inbounds float, float %input, i64 %idxprom %5 = load float, float %max_value, align 4, !tbaa !​1 %6 = load float, float %arrayidx1, align 4, !tbaa !​1 %cmp.i = fcmp olt float %5, %6 %b.a.i = select i1 %cmp.i, float %arrayidx1, float %max_value %7 = load float, float %b.a.i, align 4, !tbaa !​1 store float %7, float* %max_value, align 4, !tbaa !​1 %inc = add nsw i32 %j.0, 1 br label %for.cond

So this may seem to suggest that we need to look at "Combine redundant instructions" and fix the issue. But if I manually change the IR and remove the bitcasts, and pass it through opt again, this time SROA generates some other bitcasts. So fixing the issue in "Combine redundant instructions" is probably not enough. At this point I am not sure which approach is the best to address this issue:

1- Fixing each transformation to avoid generating bitcasts. 2- Having a cleanup pass to remove bitcasts that are not needed. 3- Leaving the bitcasts and handle them properly during code gen. 4- Some other approach?

weiguozhi commented 8 years ago

The implementation of std::max is

template inline const _Tp& max(const _Tp& __a, const _Tp& b) {
if (
a < b) return b; return __a; }

In LLVM IR it is

define linkonce_odr nonnull dereferenceable(4) float @​_ZSt3maxIfERKT_S2S2(float readonly dereferenceable(4) %a, float* readonly dereferenceable(4) %b) #​2 comdat { entry: %0 = load float, float %__a, align 4, !tbaa !​1 %1 = load float, float %b, align 4, !tbaa !​1 %cmp = fcmp olt float %0, %1 %b.a = select i1 %cmp, float* %b, float %__a ret float %b.a }

And the loop body of the test case is:

for.body: ; preds = %for.cond %idxprom = sext i32 %j.0 to i64 %arrayidx1 = getelementptr inbounds float, float %input, i64 %idxprom %call = call dereferenceable(4) float @​_ZSt3maxIfERKT_S2S2(float nonnull dereferenceable(4) %max_value, float dereferenceable(4) %arrayidx1) %5 = bitcast float %call to i32 %6 = load i32, i32 %5, align 4, !tbaa !​1 %7 = bitcast float %max_value to i32 store i32 %6, i32 %7, align 4, !tbaa !​1 %inc = add nsw i32 %j.0, 1 br label %for.cond

After inlining, the loop body is changed to:

for.body: ; preds = %for.cond %idxprom = sext i32 %j.0 to i64 %arrayidx1 = getelementptr inbounds float, float %input, i64 %idxprom %5 = load float, float %max_value, align 4, !tbaa !​1 %6 = load float, float %arrayidx1, align 4, !tbaa !​1 %cmp.i = fcmp olt float %5, %6 %b.a.i = select i1 %cmp.i, float %arrayidx1, float %max_value %7 = bitcast float %b.a.i to i32 %8 = load i32, i32 %7, align 4, !tbaa !​1 %9 = bitcast float %max_value to i32 store i32 %8, i32* %9, align 4, !tbaa !​1 %inc = add nsw i32 %j.0, 1 br label %for.cond

And later LLVM failed to find out that value %8 is actually one of %5 or %6.

Any suggestion on how fix it?

weiguozhi commented 8 years ago

This should be a mid end problem, x86 back end generates similar inefficient code.

.LBB0_2: # %for.body

=>This Inner Loop Header: Depth=1

    movss   (%rdi), %xmm1           # xmm1 = mem[0],zero,zero,zero
    ucomiss %xmm0, %xmm1
    movq    %rdi, %rcx
    ja      .LBB0_4

BB#3: # %select.false

                                    #   in Loop: Header=BB0_2 Depth=1
    movq    %rax, %rcx

.LBB0_4: # %select.end

in Loop: Header=BB0_2 Depth=1

    movl    (%rcx), %ecx
    movl    %ecx, -4(%rsp)
    movd    %ecx, %xmm0
    addq    $4, %rdi
    decl    %esi
    jne     .LBB0_2
llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-powerpc

Author: None (weiguozhi)

| | | | --- | --- | | Bugzilla Link | [26445](https://llvm.org/bz26445) | | Version | trunk | | OS | Linux | | CC | @echristo,@hfinkel,@rotateright | ## Extended Description Compile following code with options: $ ~/llvm/obj2/bin/clang++ --target=powerpc64le-grtev4-linux-gnu -m64 -mvsx -mcpu=power8 -O2 -c -o t9.o t9.cc -fno-unroll-loops #include <algorithm> float foo(float* input, int s) { float max_value = input[0]; for (int j = 1; j <= s; ++j) max_value = std::max(max_value, input[j]); return max_value; } I got following code for the loop body .LBB0_2: # %for.body # =>This Inner Loop Header: Depth=1 lfsu 0, 4(3) fcmpu 0, 1, 0 isel 5, 3, 4, 0 lwz 5, 0(5) mtvsrd 34, 5 stw 5, -12(1) xxsldwi 13, 34, 34, 1 xscvspdpn 1, 13 bdnz .LBB0_2 There are several problems in this code snippet 1. instead of compare and choose maximum float value, the generated code uses isel to choose the address of larger value, then load it into integer register, then move it to fp register, then expand it to double type. 2. no need to store the max value to memory. It causes one of our internal applications more than 2x slower.