llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.03k stars 11.96k forks source link

[X86] `sub`+`cmp`+`jmp` could be combined into `sub`+`jmp` #58636

Open chfast opened 2 years ago

chfast commented 2 years ago
void work(int);

void test(int g)
{
    g -= 3;
    if (g < 0) [[unlikely]]
        return;
    work(g);
}

produces

test(int):
        cmp     edi, 3
        jl      .LBB0_1
        add     edi, -3
        jmp     work(int)
.LBB0_1:
        ret

but the better would be (as done by GCC)

test(int):
        sub     edi, 3
        js      .L1
        jmp     work(int)
.L1:
        ret

https://godbolt.org/z/MjWe91cxc

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

davidbolvansky commented 2 years ago

Regressed since llvm 10.

chfast commented 2 years ago

Workaround:

void workaround(int g)
{
    unsigned h;
    bool o = __builtin_usub_overflow(g, 3, &h);
    if (o) [[unlikely]]
        return;
    work(h);
}

https://godbolt.org/z/MjWe91cxc

e-kud commented 1 year ago

As @davidbolvansky noticed it is a regression since llvm10. I bisected it to commit https://github.com/llvm/llvm-project/commit/0f22e783a038b6983f0fe161eef6cf2add3a4156

Apparently this change has some mystery.

As i have briefly discusses in IRC with Tim, the original motivation can no longer be recovered, too much time has passed.

However i believe that the original fold was doing the right thing, we should be performing such a transformation even if the inner add will not go away - that will still unchain the comparison from add, it will no longer need to wait for add to compute.

This change combines cmp + add even if add has multiple users.

; Function Attrs: mustprogress uwtable
define dso_local void @_Z4testi(i32 noundef %0) local_unnamed_addr #0 {
  %2 = add nsw i32 %0, 3
  %3 = icmp slt i32 %2, 0
  %4 = zext i1 %3 to i64
  br i1 %3, label %6, label %5, !prof !6

5:                                                ; preds = %1
  call void @_Z4worki(i32 noundef %2)
  br label %6

6:                                                ; preds = %1, %5
  ret void
}
*** IR Dump After InstCombinePass on _Z4testi ***
; Function Attrs: mustprogress uwtable
define dso_local void @_Z4testi(i32 noundef %0) local_unnamed_addr #0 {
  %2 = icmp slt i32 %0, -3
  br i1 %2, label %5, label %3, !prof !6

3:                                                ; preds = %1
  %4 = add nsw i32 %0, 3
  call void @_Z4worki(i32 noundef %4)
  br label %5

5:                                                ; preds = %1, %3
  ret void
}

Then icmp and add are lowered into separate cmp edi, 3 and add edi, -3 as it is shown in the subject. It seems to me that codegen is too late to fix it – the original add has already been sunk into if block and we need to do some hoisting. That's why I decided to address to the original patch.

I think it works good for the cases as shown below. We have a comparison with a value and there is no much difference with which value we compare 100 or 92.

%tmp4.7 = add nuw nsw i64 %tmp3, 8
%tmp6.7 = icmp ult i64 %tmp4.7, 100

In case of comparison with non-zero, we impose extra arithmetic. However, many targets have an ability to compare effectively with zero after math operations. I propose to limit the original patch to comparisons with non-zero values. D144764.

It changes X86 as follows (even better than GCC):

# Before D144764
test(int):                               # @test(int)
        cmp     edi, -3
        jl      .LBB0_1
        add     edi, 3
        jmp     work(int)@PLT                    # TAILCALL
.LBB0_1:
        ret

# After D144764
test(int):                               # @test(int)
        addl    $-3, %edi                                  
        jns     work(int)@PLT                     # TAILCALL
        retq                           

For AArch64 we need to change g -= 3 -> g += 3 to see the effect.

// Before D144764
test(int):                               // @test(int)
        cmn     w0, #3
        b.lt    .LBB0_2
        add     w0, w0, #3
        b       work(int)
.LBB0_2:
        ret

// After D144764
test(int):                               // @test(int)
        adds    w0, w0, #3                         
        b.mi    .LBB0_2                            
        b       _Z4worki                           
.LBB0_2:
        ret                                        

@LebedevRI what do you think?

nikic commented 1 year ago

There already is a backend undo fold for this at https://github.com/llvm/llvm-project/blob/7910ed1d56c349b76c82d5ebe2f2590770955ff5/llvm/lib/CodeGen/CodeGenPrepare.cpp#L7956

For add, it's currently limited to equality comparisons though, so it doesn't handle this precise pattern yet.

chfast commented 1 year ago

@LebedevRI any update here? I can try to address in CodeGenPrepare as pointed by @nikic but there is also a fix draft by @e-kud in D144764.

chfast commented 1 year ago

@LebedevRI any update here? I can try to address in CodeGenPrepare as pointed by @nikic but there is also a fix draft by @e-kud in D144764.

nikic commented 1 year ago

If the backend undo fold can recover this, we'll want to do that. If not, we can start considering InstCombine limitations.

e-kud commented 1 year ago

Sorry, I haven't thought that there is still the interest to it.

I've made the change suggested by @nikic. It is here https://reviews.llvm.org/D145454. It works fine but it revealed a problem (my previous approach has the same problem, but it is hidden from tests).

Consider the following code.

void decref(long* refcount) {
   long count = *refcount;
   if (count == 1) { free_object() }
   else if (count > 1) { *refcount = count - 1; }
   else { /* immortal */ }
}

Here we can make only one comparison of count against 1 and use it for both branches. This optimization is implemented.

When we move to the usage of flags from add or sub we have the following situation:

    decl %eax
    je .LBB0_3
# %bb.1: # %bb2
    testl %eax, %eax
    jle .LBB0_4

And generally we can't eliminate test because of decl because test resets CF flag in contrast to add. The bummer here is that test appeared from CMP, 0, but what if it isn't?

I tried to use hasNoSignedWrap predicate but it is not propagated to CodeGen.

This is where I stuck, any ideas?