Speed regression in 0.29.0

ml31415 commented 8 years ago

Referring to the discussion in #2176 it looks like there is a major speed regression from 0.28.1 to 0.29.0. The example function below and similar tight loops, doing merely more than nan checks, additions etc, only run at a fraction of the initial speed. This heavily affects the usability of numba as a replacement for some C implementations and renders 0.29.0 unusable for me. I'd also be happy to sponsor a bug bounty for this issue.

import numpy as np
import numba as nb
print nb.__version__

@nb.njit
def nanmin_numbagg_1dim(a):
    amin = np.infty
    all_missing = 1
    for ai in a.flat:
        if ai <= amin:
            amin = ai
            all_missing = 0
    if all_missing:
        amin = np.nan
    return amin

x = np.random.random(100000)
x[x>0.7] = np.nan
%timeit nanmin_numbagg_1dim(x)

0.34.0
10000 loops, best of 3: 190 µs per loop
0.29.0
10000 loops, best of 3: 192 µs per loop
0.28.1
10000 loops, best of 3: 57.4 µs per loop

sklam commented 7 years ago

As @stuartarchibald pointed out in https://github.com/numba/numba/issues/2176#issuecomment-268667308 , the slowdown is caused by PR #2050 , which adds a common control-flow block to the end of the loop body.

This transform the function nanmin_numbagg_1dim from:

ok gv

into

bad gv

(note: new L75 block for holding decref calls)

However, the LLVM optimization (even at O1) will split the loop into two, causing the slowdown.

In effect, the transformation is similar to:

@nb.njit
def nanmin_numbagg_1dim_pathological(a):
    amin = np.infty
    all_missing = 1
    for ai in a.flat:
        if ai <= amin:
            amin = ai
            all_missing = 0
        # dummy code; should have no side-effect
        all_missing
    if all_missing:
        amin = np.nan
    return amin

Which also causes the loop-splitting and slowdown on commit before the PR #2050.

I wonder if this is a pathological case for LLVM optimizer.

sklam commented 7 years ago

I can replicate the problem in C with clang. See https://gist.github.com/sklam/11f11a410258ca191e6f263262a4ea65.

sklam commented 7 years ago

I have further narrowed down the problem to multiple calls to llvm simplifycfg pass. I am asking on llvm dev mailing list for the experts to take a look.

ml31415 commented 7 years ago

For reference, this is the mailing list entry:

https://groups.google.com/forum/#!topic/llvm-dev/7HnK9ehPzKc

It's suggested there to file a bug report, which I couldn't find. Though I also may have gotten the wrong keywords. @sklam is this bug report alredy filed somewhere, as suggested in the mailing list? Otherwise I'd be fine filing the bug report myself, copying your mailing list entry. Not sure if I could contribute something else to this issue.

This bug there seems related:

https://llvm.org/bugs/show_bug.cgi?id=30452

sklam commented 7 years ago

@ml31415 , sorry, I keep getting distracted by other things and it has dropped off the radar. I still need to get a login for their bug tracker. If you file the issue, can you point me to it so I can comment on it?

ml31415 commented 7 years ago

Ok, I suppose you didn't have a look at the related bug I had posted then. I can't tell for sure due to a lack of assembler experience, but to me it looks quite like it's the same bug that we found. The problem there is also about extra load within loops since LLVM 3.9. If you agree, I would rather add your findings there, instead of opening a duplicate.


Description Brian Rzycki 2016-09-19 17:33:29 PDT
The following commit causes an extra load inside a loop.

commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a
Author: James Molloy <james.molloy@arm.com>
Date:   Sun Sep 11 09:00:03 2016 +0000

    [SimplifyCFG] Be even more conservative in SinkThenElseCodeToEnd

    This should *actually* fix PR30244. This cranks up the workaround for PR30188 so that we never sink loads or stores of allocas.

    The idea is that these should be removed by SROA/Mem2Reg, and any movement of them may well confuse SROA or just cause unwanted code churn. It's not ideal that the midend should be crippled like this, but that unwanted churn can really cause significant regressions in important workloads (tsan).

$ cat foo.cpp
#include <map>

int test(unsigned *keys, std::map<int, int> &m_map)
{
  int i, last_index, sane=0;

  for (i=0, last_index = 0; i<100; i++)
    {
      auto it = m_map.find(keys[last_index++]);
      if (it != m_map.end())
        sane += it->second;
    }

  return sane;
}
$ clang++  foo.cpp -O3 -S -o out.s

--- good.s      2016-09-19 17:25:03.708062780 -0500
+++ bad.s       2016-09-19 17:25:26.584666253 -0500
@@ -6,7 +6,7 @@
 _Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE: // @_Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE
 // BB#0:                                // %entry
        ldr     x9, [x1, #8]!
-       cbz     x9, .LBB0_9
+       cbz     x9, .LBB0_11
 // BB#1:                                // %for.body.preheader
        mov      x10, xzr
        mov      w8, wzr
@@ -14,40 +14,46 @@
                                         // =>This Loop Header: Depth=1
                                         //     Child Loop BB0_3 Depth 2
        ldr     w12, [x0, x10, lsl #2]
+       add     x10, x10, #1            // =1
        mov      x11, x1
        mov      x13, x9
 .LBB0_3:                                // %while.body.i.i.i
                                         //   Parent Loop BB0_2 Depth=1
                                         // =>  This Inner Loop Header: Depth=2
        ldr     w14, [x13, #28]
-       add     x15, x13, #8            // =8
        cmp             w14, w12
-       csel    x11, x11, x13, lt
-       csel    x13, x15, x13, lt
+       b.ge    .LBB0_5
+// BB#4:                                // %if.else.i.i.i
+                                        //   in Loop: Header=BB0_3 Depth=2
+       ldr     x13, [x13, #8]
+       cbnz    x13, .LBB0_3
+       b       .LBB0_6
+.LBB0_5:                                // %if.then.i.i.i
+                                        //   in Loop: Header=BB0_3 Depth=2
+       mov      x11, x13
        ldr             x13, [x13]
        cbnz    x13, .LBB0_3

sklam commented 7 years ago

@ml31415, they seem like related bugs but the reason for the slowness could be different, probably due to architecture different (AArch64 vs AMD64). In our case, we are not seeing extra memory load. I suspect it is due to the inefficiency in the extra select-instruction.

The slow function (see my comments added below label LBB0_2):

_apple:                                 ## @apple
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp0:
    .cfi_def_cfa_offset 16
Ltmp1:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp2:
    .cfi_def_cfa_register %rbp
    movsd   LCPI0_0(%rip), %xmm0    ## xmm0 = mem[0],zero
    movl    $1, %eax
    xorl    %ecx, %ecx
    xorl    %edx, %edx
    jmp LBB0_1
    .p2align    4, 0x90
LBB0_2:                                 ##   in Loop: Header=BB0_1 Depth=1
    movsd   (%rdi,%rdx,8), %xmm1    ## xmm1 = mem[0],zero
        ####################################
        ## I think the below corresponds to the select 
    ucomisd %xmm1, %xmm0
    cmovael %ecx, %eax
    movapd  %xmm1, %xmm2
    cmplesd %xmm0, %xmm2
    andpd   %xmm2, %xmm1
    andnpd  %xmm0, %xmm2
    orpd    %xmm1, %xmm2
    incq    %rdx
    movapd  %xmm2, %xmm0
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    cmpl    %esi, %edx
    jl  LBB0_2
## BB#3:
    testl   %eax, %eax
    je  LBB0_5
## BB#4:
    movsd   LCPI0_1(%rip), %xmm0    ## xmm0 = mem[0],zero
LBB0_5:
    popq    %rbp
    retq
    .cfi_endproc

The fast function:

_orange:                                ## @orange
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp3:
    .cfi_def_cfa_offset 16
Ltmp4:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp5:
    .cfi_def_cfa_register %rbp
    movsd   LCPI1_0(%rip), %xmm0    ## xmm0 = mem[0],zero
    xorl    %eax, %eax
    movl    $1, %ecx
    jmp LBB1_1
    .p2align    4, 0x90
LBB1_4:                                 ##   in Loop: Header=BB1_1 Depth=1
    xorl    %ecx, %ecx
    movapd  %xmm1, %xmm0
LBB1_1:                                 ## %.outer
                                        ## =>This Loop Header: Depth=1
                                        ##     Child Loop BB1_2 Depth 2
    cltq
    .p2align    4, 0x90
LBB1_2:                                 ##   Parent Loop BB1_1 Depth=1
                                        ## =>  This Inner Loop Header: Depth=2
    cmpl    %esi, %eax
    jge LBB1_5
## BB#3:                                ##   in Loop: Header=BB1_2 Depth=2
    movsd   (%rdi,%rax,8), %xmm1    ## xmm1 = mem[0],zero
    incq    %rax
    ucomisd %xmm1, %xmm0
    jb  LBB1_2
    jmp LBB1_4
LBB1_5:
    testl   %ecx, %ecx
    je  LBB1_7
## BB#6:
    movsd   LCPI1_1(%rip), %xmm0    ## xmm0 = mem[0],zero
LBB1_7:
    popq    %rbp
    retq
    .cfi_endproc

ml31415 commented 7 years ago

I had posted a comment at the LLVM bug tracker some hours ago. I suggest to continue the discussion here, at least for the LLVM parts of the issue:

https://bugs.llvm.org//show_bug.cgi?id=30452

sklam commented 7 years ago

Thanks! I will ask some SSE experts to look at the generated code.

Btw, I cannot chime in yet because I am still waiting for an account to the bug tracker.

ml31415 commented 7 years ago

Yeah, I remember that took a while. If I shall post smth there on your behalf, let me know!

ml31415 commented 7 years ago

From hiraditya in the LLVM bugtracker:

It seems that the patch (https://github.com/numba/numba/commit/e03a4170fdc59a87561394ccdfa0f4abfa7ec1ac) which canonicalizes the backedge was added in numba and caused regression. Now it makes sense because when I see the IR of apples function, it would create bad code because of the structure of control flow graph. The patch is pessimizing the code. I would suggest reverting the patch if that is possible. From my analysis it would require undoing the canonicalization of that patch in the compiler to get rid of selects and enable proper vectorization i.e., splitting the back-edge into multiple back-edges (which is fairly complicated to do in the compiler)

sklam commented 7 years ago

I am aware of the canonicalization triggering the performance regression. It is not easily revertible because it is resolving a bug. Also, removing the canonicalization may improve the nanmin code but not every case. The LLVM opt passes may canonicalize the backedges anyway. For example, if I modify the apple function to:

double apple(double *arr, int size) {
    double amin = INFINITY;
    int all_missing = 1;
    int i;
    double ai;
    for (i=0; i<size;) {
        ai = arr[i];

        if ( ai <= amin ) {
            amin = ai;
            all_missing = 0;
            ++i;
            continue;  // backedge
        }else{
            ++i;
            continue;  // backedge
        }
        break; // unreachable
    }
    if (all_missing) {
        amin = NAN;
    }
    return amin;
}

so that it has multiple backedges (Use clang -emit-llvm -S nanmin.ll; opt -view-cfg nanmin.ll to verify), the speed regression still occur at the 2nd application of the the simplifycfg pass in the pass sequence of -simplifycfg -sroa -simplifycfg. The 2nd simplifycfg merge the backedges.

I have also noticed that inserting a loop-simplify pass before the 2nd simplifycfg will allow it to produce fast code. Try it with opt -simplifycfg -sroa -loop-simplify -simplifycfg nanmin.ll -o nanmin.o. At this point, the cfg looks like:

apple_simplifycfg_sroa_loopsimplify_simplifycfg dot

ml31415 commented 7 years ago

I suggest we open a new bug report at LLVM then, when you got your account activated.

sklam commented 7 years ago

Yes, thanks for your help @ml31415 .

ml31415 commented 7 years ago

Nothing to thank for, I was the guy with the issue after all, so big thanks for digging into this!

So how to proceed. Is there something, that may be done on the numba side? How I understand the situation, it's about fixing a bunch of optimizations, that in some cases make sense, in others cause regressions. And there seems to be no simple way to separate these cases. Can this separation be improved? Would some heuristics make sense? Does this have to happen on the LLVM side or maybe also the numba side? Is there some way, how numba could provide extra information or optimization flags, to achieve better optimized compilation?

sklam commented 7 years ago

In the long term, this has to be a LLVM fix. It is affecting clang performance.

In the short term, we can look at workarounds. We will have to use our benchmarks to drive this. Perhaps, picking our sets of optimization passes instead of using what is equivalent to O2 and O3 in opt. Maybe some heuristics to alter the canonicalization of backedges. It'll be tricky though.

sklam commented 7 years ago

I've filed a LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=32022

ml31415 commented 7 years ago

A short update on this: With LLVM 4.0 and numba 0.34, the problem still persists.

0.34.0
10000 loops, best of 3: 190 µs per loop

stuartarchibald commented 6 years ago

Still happens in with 0.37.0 just prior to 0.38.0 RC, LLVM 6 is now in place.

ml31415 commented 6 years ago

Unfortunately, no one at LLVM really seems to care :(

esc commented 4 years ago

We are now at Numba 0.51.2 and llvmlite 0.34.0 with LLVM 10. Perhaps it is worth checking if these issues remain?

guilhermeleobas commented 2 years ago

It seems this issue is solved in newer versions of LLVM. The orange and apples example takes roughly the same time to finish on my machine.

$ ./main
ra = -0.123213 | rb = -0.123213
apple 0.076935
orange 0.077305

Also, the python script sent by @ml31415 runs faster on Numba 0.55

import numpy as np
import timeit
import numba
print('Numba version: {0}'.format(numba.__version__))

def measure(func, *args):
    def setup():
        return func(*args)

    t = timeit.Timer(setup=setup)
    name = func.__name__
    _time = min(t.repeat(repeat=100, number=10))
    print('{0} took {1}'.format(name, _time))

def nanmin_numbagg_1dim(a):
    amin = np.infty
    all_missing = 1
    for ai in a.flat:
        if ai <= amin:
            amin = ai
            all_missing = 0
    if all_missing:
        amin = np.nan
    return amin

x = np.random.random(100000)
x[x>0.7] = np.nan
measure(nanmin_numbagg_1dim, x)

Numba version: 0.28.1
nanmin_numbagg_1dim took 4.0099985199049115e-07

Numba version: 0.37.0+653.g5168891b9
nanmin_numbagg_1dim took 2.8999966161791235e-07

Numba version: 0.55.1
nanmin_numbagg_1dim took 2.6100042305188254e-07

sklam commented 2 years ago

@guilhermeleobas Oh! that's good news! While you are at it, can you post the assembly code so we can compare it with the previous ones.

guilhermeleobas commented 2 years ago

From the apple and oranges? or the python function?

edit: here is the assembly code for the apples and oranges example (Compiler Explorer)

    .text
    .file   "b.cc"
    .section    .rodata.cst8,"aM",@progbits,8
    .p2align    3               # -- Begin function _Z5applePdi
.LCPI0_0:
    .quad   9221120237041090560     # double NaN
.LCPI0_1:
    .quad   9218868437227405312     # double +Inf
    .text
    .globl  _Z5applePdi
    .p2align    4, 0x90
    .type   _Z5applePdi,@function
_Z5applePdi:                            # @_Z5applePdi
    .cfi_startproc
# %bb.0:
    testl   %esi, %esi
    jle .LBB0_9
# %bb.1:
    movl    %esi, %edx
    leaq    -1(%rdx), %rcx
    movl    %edx, %r8d
    andl    $3, %r8d
    xorl    %eax, %eax
    cmpq    $3, %rcx
    jae .LBB0_3
# %bb.2:
    movl    $1, %ecx
    movsd   .LCPI0_1(%rip), %xmm1   # xmm1 = mem[0],zero
    xorl    %esi, %esi
    jmp .LBB0_5
.LBB0_3:
    subq    %r8, %rdx
    movl    $1, %ecx
    movsd   .LCPI0_1(%rip), %xmm1   # xmm1 = mem[0],zero
    xorl    %esi, %esi
    .p2align    4, 0x90
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
    movsd   (%rdi,%rsi,8), %xmm0    # xmm0 = mem[0],zero
    movsd   8(%rdi,%rsi,8), %xmm2   # xmm2 = mem[0],zero
    ucomisd %xmm0, %xmm1
    movapd  %xmm0, %xmm3
    cmpnlesd    %xmm1, %xmm3
    movapd  %xmm3, %xmm4
    andnpd  %xmm0, %xmm4
    andpd   %xmm1, %xmm3
    orpd    %xmm4, %xmm3
    cmovael %eax, %ecx
    ucomisd %xmm2, %xmm3
    movapd  %xmm2, %xmm0
    cmpnlesd    %xmm3, %xmm0
    movapd  %xmm0, %xmm1
    andpd   %xmm3, %xmm1
    andnpd  %xmm2, %xmm0
    orpd    %xmm1, %xmm0
    cmovael %eax, %ecx
    movsd   16(%rdi,%rsi,8), %xmm1  # xmm1 = mem[0],zero
    ucomisd %xmm1, %xmm0
    movapd  %xmm1, %xmm2
    cmpnlesd    %xmm0, %xmm2
    movapd  %xmm2, %xmm3
    andpd   %xmm0, %xmm3
    andnpd  %xmm1, %xmm2
    orpd    %xmm3, %xmm2
    movsd   24(%rdi,%rsi,8), %xmm0  # xmm0 = mem[0],zero
    cmovael %eax, %ecx
    ucomisd %xmm0, %xmm2
    cmovael %eax, %ecx
    movapd  %xmm0, %xmm1
    cmpnlesd    %xmm2, %xmm1
    movapd  %xmm1, %xmm3
    andpd   %xmm2, %xmm3
    andnpd  %xmm0, %xmm1
    orpd    %xmm3, %xmm1
    addq    $4, %rsi
    cmpq    %rsi, %rdx
    jne .LBB0_4
.LBB0_5:
    movapd  %xmm1, %xmm0
    testq   %r8, %r8
    je  .LBB0_8
# %bb.6:
    leaq    (%rdi,%rsi,8), %rax
    xorl    %edx, %edx
    xorl    %esi, %esi
    .p2align    4, 0x90
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
    movsd   (%rax,%rsi,8), %xmm2    # xmm2 = mem[0],zero
    ucomisd %xmm2, %xmm1
    cmovael %edx, %ecx
    movapd  %xmm2, %xmm0
    cmpnlesd    %xmm1, %xmm0
    movapd  %xmm0, %xmm3
    andnpd  %xmm2, %xmm3
    andpd   %xmm1, %xmm0
    orpd    %xmm3, %xmm0
    addq    $1, %rsi
    movapd  %xmm0, %xmm1
    cmpq    %rsi, %r8
    jne .LBB0_7
.LBB0_8:
    testl   %ecx, %ecx
    je  .LBB0_10
.LBB0_9:
    movsd   .LCPI0_0(%rip), %xmm0   # xmm0 = mem[0],zero
.LBB0_10:
    retq
.Lfunc_end0:
    .size   _Z5applePdi, .Lfunc_end0-_Z5applePdi
    .cfi_endproc
                                        # -- End function
    .section    .rodata.cst8,"aM",@progbits,8
    .p2align    3               # -- Begin function _Z6orangePdi
.LCPI1_0:
    .quad   9221120237041090560     # double NaN
.LCPI1_1:
    .quad   9218868437227405312     # double +Inf
    .text
    .globl  _Z6orangePdi
    .p2align    4, 0x90
    .type   _Z6orangePdi,@function
_Z6orangePdi:                           # @_Z6orangePdi
    .cfi_startproc
# %bb.0:
    testl   %esi, %esi
    jle .LBB1_9
# %bb.1:
    movl    %esi, %edx
    leaq    -1(%rdx), %rcx
    movl    %edx, %r8d
    andl    $3, %r8d
    xorl    %eax, %eax
    cmpq    $3, %rcx
    jae .LBB1_3
# %bb.2:
    movl    $1, %ecx
    movsd   .LCPI1_1(%rip), %xmm1   # xmm1 = mem[0],zero
    xorl    %esi, %esi
    jmp .LBB1_5
.LBB1_3:
    subq    %r8, %rdx
    movl    $1, %ecx
    movsd   .LCPI1_1(%rip), %xmm1   # xmm1 = mem[0],zero
    xorl    %esi, %esi
    .p2align    4, 0x90
.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    movsd   (%rdi,%rsi,8), %xmm0    # xmm0 = mem[0],zero
    movsd   8(%rdi,%rsi,8), %xmm2   # xmm2 = mem[0],zero
    ucomisd %xmm0, %xmm1
    movapd  %xmm0, %xmm3
    cmpnlesd    %xmm1, %xmm3
    movapd  %xmm3, %xmm4
    andnpd  %xmm0, %xmm4
    andpd   %xmm1, %xmm3
    orpd    %xmm4, %xmm3
    cmovael %eax, %ecx
    ucomisd %xmm2, %xmm3
    movapd  %xmm2, %xmm0
    cmpnlesd    %xmm3, %xmm0
    movapd  %xmm0, %xmm1
    andpd   %xmm3, %xmm1
    andnpd  %xmm2, %xmm0
    orpd    %xmm1, %xmm0
    cmovael %eax, %ecx
    movsd   16(%rdi,%rsi,8), %xmm1  # xmm1 = mem[0],zero
    ucomisd %xmm1, %xmm0
    movapd  %xmm1, %xmm2
    cmpnlesd    %xmm0, %xmm2
    movapd  %xmm2, %xmm3
    andpd   %xmm0, %xmm3
    andnpd  %xmm1, %xmm2
    orpd    %xmm3, %xmm2
    movsd   24(%rdi,%rsi,8), %xmm0  # xmm0 = mem[0],zero
    cmovael %eax, %ecx
    ucomisd %xmm0, %xmm2
    leaq    4(%rsi), %rsi
    cmovael %eax, %ecx
    movapd  %xmm0, %xmm1
    cmpnlesd    %xmm2, %xmm1
    movapd  %xmm1, %xmm3
    andpd   %xmm2, %xmm3
    andnpd  %xmm0, %xmm1
    orpd    %xmm3, %xmm1
    cmpq    %rsi, %rdx
    jne .LBB1_4
.LBB1_5:
    movapd  %xmm1, %xmm0
    testq   %r8, %r8
    je  .LBB1_8
# %bb.6:
    leaq    (%rdi,%rsi,8), %rax
    xorl    %edx, %edx
    xorl    %esi, %esi
    .p2align    4, 0x90
.LBB1_7:                                # =>This Inner Loop Header: Depth=1
    movsd   (%rax,%rsi,8), %xmm2    # xmm2 = mem[0],zero
    ucomisd %xmm2, %xmm1
    cmovael %edx, %ecx
    movapd  %xmm2, %xmm0
    cmpnlesd    %xmm1, %xmm0
    movapd  %xmm0, %xmm3
    andnpd  %xmm2, %xmm3
    andpd   %xmm1, %xmm0
    orpd    %xmm3, %xmm0
    addq    $1, %rsi
    movapd  %xmm0, %xmm1
    cmpq    %rsi, %r8
    jne .LBB1_7
.LBB1_8:
    testl   %ecx, %ecx
    je  .LBB1_10
.LBB1_9:
    movsd   .LCPI1_0(%rip), %xmm0   # xmm0 = mem[0],zero
.LBB1_10:
    retq
.Lfunc_end1:
    .size   _Z6orangePdi, .Lfunc_end1-_Z6orangePdi
    .cfi_endproc
                                        # -- End function
    .ident  "clang version 10.0.0-4ubuntu1 "
    .section    ".note.GNU-stack","",@progbits
    .addrsig

sklam commented 2 years ago

@guilhermeleobas can you check what an older clang would do? There's a clang=4.0.1 on conda-forge. I am seeing that the newer clang is producing equally slow code for the apple vs orange.

On my machine, clang 4.0.1 gives:

ra = -0.123213 | rb = -0.123213
apple 0.230488
orange 0.044694

clang 13 gives:

% clang -O3 -o newmain  main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.227005
orange 0.199745
% clang -O1 -o newmain  main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.213189
orange 0.188846
% clang -O0 -o newmain  main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.201056
orange 0.157053

less optimization is better?!

ml31415 commented 2 years ago

For reference, our previously created LLVM bug can be found here now: https://github.com/llvm/llvm-project/issues/31370

guilhermeleobas commented 2 years ago

@sklam, I had to use the pre-build binary from the llvm website. Clang 4.0.1 on conda-forge is only available for OSX.

$ ./clang4/clang/bin/clang --version
clang version 4.0.1 (tags/RELEASE_401/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/guilhermeleobas/./clang4/clang/bin

$ ./clang4/clang/bin/clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078439
orange 0.044631

$ ./clang4/clang/bin/clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078722
orange 0.044328

$ ./clang4/clang/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.101776
orange 0.101166

$ ./clang4/clang/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.101776
orange 0.101166

For comparison, system clang (10.0) gives the following results

$ clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078269
orange 0.077284

$ clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077202
orange 0.077039

$ clang -O1 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077812
orange 0.078302

$ clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.102861
orange 0.104594

edit: With clang 13, orange and apples takes roughly the same time, regardless of the optimization level used

$ ./clang13/bin/clang --version
clang version 13.0.1
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/guilhermeleobas/./clang13/bin
(base)

$ ./clang13/bin/clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077494
orange 0.077163
(base)

$ ./clang13/bin/clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077468
orange 0.077152
(base)

$ ./clang13/bin/clang -O1 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078136
orange 0.078680
(base)

$ ./clang13/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.114825
orange 0.102888

guilhermeleobas commented 2 years ago

@sklam, although LLVM seems to be producing slow code for both orange and apples example. Numba 0.55 is faster than 0.28 in running the original example.

sklam commented 2 years ago

Right, so much has changed that the C code no longer represents the Numba code here.

As for the C code, i think the problem comes from LLVM choosing to simd-vectorize that code pattern even though the the scalar path is better. I also think there is a problem of CPU arch and OS behavior. I am guessing @guilhermeleobas is running linux and i'm on OSX. Could it be the alignment of double arr[size] vs the SIMD misalign penalty?

numba / numba

Speed regression in 0.29.0 #2196