Open ml31415 opened 8 years ago
As @stuartarchibald pointed out in https://github.com/numba/numba/issues/2176#issuecomment-268667308 , the slowdown is caused by PR #2050 , which adds a common control-flow block to the end of the loop body.
This transform the function nanmin_numbagg_1dim
from:
into
(note: new L75 block for holding decref calls)
However, the LLVM optimization (even at O1) will split the loop into two, causing the slowdown.
In effect, the transformation is similar to:
@nb.njit
def nanmin_numbagg_1dim_pathological(a):
amin = np.infty
all_missing = 1
for ai in a.flat:
if ai <= amin:
amin = ai
all_missing = 0
# dummy code; should have no side-effect
all_missing
if all_missing:
amin = np.nan
return amin
Which also causes the loop-splitting and slowdown on commit before the PR #2050.
I wonder if this is a pathological case for LLVM optimizer.
I can replicate the problem in C with clang. See https://gist.github.com/sklam/11f11a410258ca191e6f263262a4ea65.
I have further narrowed down the problem to multiple calls to llvm simplifycfg pass. I am asking on llvm dev mailing list for the experts to take a look.
For reference, this is the mailing list entry:
https://groups.google.com/forum/#!topic/llvm-dev/7HnK9ehPzKc
It's suggested there to file a bug report, which I couldn't find. Though I also may have gotten the wrong keywords. @sklam is this bug report alredy filed somewhere, as suggested in the mailing list? Otherwise I'd be fine filing the bug report myself, copying your mailing list entry. Not sure if I could contribute something else to this issue.
This bug there seems related:
@ml31415 , sorry, I keep getting distracted by other things and it has dropped off the radar. I still need to get a login for their bug tracker. If you file the issue, can you point me to it so I can comment on it?
Ok, I suppose you didn't have a look at the related bug I had posted then. I can't tell for sure due to a lack of assembler experience, but to me it looks quite like it's the same bug that we found. The problem there is also about extra load within loops since LLVM 3.9. If you agree, I would rather add your findings there, instead of opening a duplicate.
Description Brian Rzycki 2016-09-19 17:33:29 PDT
The following commit causes an extra load inside a loop.
commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a
Author: James Molloy <james.molloy@arm.com>
Date: Sun Sep 11 09:00:03 2016 +0000
[SimplifyCFG] Be even more conservative in SinkThenElseCodeToEnd
This should *actually* fix PR30244. This cranks up the workaround for PR30188 so that we never sink loads or stores of allocas.
The idea is that these should be removed by SROA/Mem2Reg, and any movement of them may well confuse SROA or just cause unwanted code churn. It's not ideal that the midend should be crippled like this, but that unwanted churn can really cause significant regressions in important workloads (tsan).
$ cat foo.cpp
#include <map>
int test(unsigned *keys, std::map<int, int> &m_map)
{
int i, last_index, sane=0;
for (i=0, last_index = 0; i<100; i++)
{
auto it = m_map.find(keys[last_index++]);
if (it != m_map.end())
sane += it->second;
}
return sane;
}
$ clang++ foo.cpp -O3 -S -o out.s
--- good.s 2016-09-19 17:25:03.708062780 -0500
+++ bad.s 2016-09-19 17:25:26.584666253 -0500
@@ -6,7 +6,7 @@
_Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE: // @_Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE
// BB#0: // %entry
ldr x9, [x1, #8]!
- cbz x9, .LBB0_9
+ cbz x9, .LBB0_11
// BB#1: // %for.body.preheader
mov x10, xzr
mov w8, wzr
@@ -14,40 +14,46 @@
// =>This Loop Header: Depth=1
// Child Loop BB0_3 Depth 2
ldr w12, [x0, x10, lsl #2]
+ add x10, x10, #1 // =1
mov x11, x1
mov x13, x9
.LBB0_3: // %while.body.i.i.i
// Parent Loop BB0_2 Depth=1
// => This Inner Loop Header: Depth=2
ldr w14, [x13, #28]
- add x15, x13, #8 // =8
cmp w14, w12
- csel x11, x11, x13, lt
- csel x13, x15, x13, lt
+ b.ge .LBB0_5
+// BB#4: // %if.else.i.i.i
+ // in Loop: Header=BB0_3 Depth=2
+ ldr x13, [x13, #8]
+ cbnz x13, .LBB0_3
+ b .LBB0_6
+.LBB0_5: // %if.then.i.i.i
+ // in Loop: Header=BB0_3 Depth=2
+ mov x11, x13
ldr x13, [x13]
cbnz x13, .LBB0_3
@ml31415, they seem like related bugs but the reason for the slowness could be different, probably due to architecture different (AArch64 vs AMD64). In our case, we are not seeing extra memory load. I suspect it is due to the inefficiency in the extra select-instruction.
The slow function (see my comments added below label LBB0_2
):
_apple: ## @apple
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
movsd LCPI0_0(%rip), %xmm0 ## xmm0 = mem[0],zero
movl $1, %eax
xorl %ecx, %ecx
xorl %edx, %edx
jmp LBB0_1
.p2align 4, 0x90
LBB0_2: ## in Loop: Header=BB0_1 Depth=1
movsd (%rdi,%rdx,8), %xmm1 ## xmm1 = mem[0],zero
####################################
## I think the below corresponds to the select
ucomisd %xmm1, %xmm0
cmovael %ecx, %eax
movapd %xmm1, %xmm2
cmplesd %xmm0, %xmm2
andpd %xmm2, %xmm1
andnpd %xmm0, %xmm2
orpd %xmm1, %xmm2
incq %rdx
movapd %xmm2, %xmm0
LBB0_1: ## =>This Inner Loop Header: Depth=1
cmpl %esi, %edx
jl LBB0_2
## BB#3:
testl %eax, %eax
je LBB0_5
## BB#4:
movsd LCPI0_1(%rip), %xmm0 ## xmm0 = mem[0],zero
LBB0_5:
popq %rbp
retq
.cfi_endproc
The fast function:
_orange: ## @orange
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
movsd LCPI1_0(%rip), %xmm0 ## xmm0 = mem[0],zero
xorl %eax, %eax
movl $1, %ecx
jmp LBB1_1
.p2align 4, 0x90
LBB1_4: ## in Loop: Header=BB1_1 Depth=1
xorl %ecx, %ecx
movapd %xmm1, %xmm0
LBB1_1: ## %.outer
## =>This Loop Header: Depth=1
## Child Loop BB1_2 Depth 2
cltq
.p2align 4, 0x90
LBB1_2: ## Parent Loop BB1_1 Depth=1
## => This Inner Loop Header: Depth=2
cmpl %esi, %eax
jge LBB1_5
## BB#3: ## in Loop: Header=BB1_2 Depth=2
movsd (%rdi,%rax,8), %xmm1 ## xmm1 = mem[0],zero
incq %rax
ucomisd %xmm1, %xmm0
jb LBB1_2
jmp LBB1_4
LBB1_5:
testl %ecx, %ecx
je LBB1_7
## BB#6:
movsd LCPI1_1(%rip), %xmm0 ## xmm0 = mem[0],zero
LBB1_7:
popq %rbp
retq
.cfi_endproc
I had posted a comment at the LLVM bug tracker some hours ago. I suggest to continue the discussion here, at least for the LLVM parts of the issue:
Thanks! I will ask some SSE experts to look at the generated code.
Btw, I cannot chime in yet because I am still waiting for an account to the bug tracker.
Yeah, I remember that took a while. If I shall post smth there on your behalf, let me know!
From hiraditya in the LLVM bugtracker:
It seems that the patch (https://github.com/numba/numba/commit/e03a4170fdc59a87561394ccdfa0f4abfa7ec1ac) which canonicalizes the backedge was added in numba and caused regression. Now it makes sense because when I see the IR of apples function, it would create bad code because of the structure of control flow graph. The patch is pessimizing the code. I would suggest reverting the patch if that is possible. From my analysis it would require undoing the canonicalization of that patch in the compiler to get rid of selects and enable proper vectorization i.e., splitting the back-edge into multiple back-edges (which is fairly complicated to do in the compiler)
I am aware of the canonicalization triggering the performance regression. It is not easily revertible because it is resolving a bug. Also, removing the canonicalization may improve the nanmin code but not every case. The LLVM opt passes may canonicalize the backedges anyway. For example, if I modify the apple
function to:
double apple(double *arr, int size) {
double amin = INFINITY;
int all_missing = 1;
int i;
double ai;
for (i=0; i<size;) {
ai = arr[i];
if ( ai <= amin ) {
amin = ai;
all_missing = 0;
++i;
continue; // backedge
}else{
++i;
continue; // backedge
}
break; // unreachable
}
if (all_missing) {
amin = NAN;
}
return amin;
}
so that it has multiple backedges (Use clang -emit-llvm -S nanmin.ll; opt -view-cfg nanmin.ll
to verify), the speed regression still occur at the 2nd application of the the simplifycfg pass in the pass sequence of -simplifycfg -sroa -simplifycfg
. The 2nd simplifycfg merge the backedges.
I have also noticed that inserting a loop-simplify pass before the 2nd simplifycfg will allow it to produce fast code. Try it with opt -simplifycfg -sroa -loop-simplify -simplifycfg nanmin.ll -o nanmin.o
. At this point, the cfg looks like:
I suggest we open a new bug report at LLVM then, when you got your account activated.
Yes, thanks for your help @ml31415 .
Nothing to thank for, I was the guy with the issue after all, so big thanks for digging into this!
So how to proceed. Is there something, that may be done on the numba side? How I understand the situation, it's about fixing a bunch of optimizations, that in some cases make sense, in others cause regressions. And there seems to be no simple way to separate these cases. Can this separation be improved? Would some heuristics make sense? Does this have to happen on the LLVM side or maybe also the numba side? Is there some way, how numba could provide extra information or optimization flags, to achieve better optimized compilation?
In the long term, this has to be a LLVM fix. It is affecting clang performance.
In the short term, we can look at workarounds. We will have to use our benchmarks to drive this. Perhaps, picking our sets of optimization passes instead of using what is equivalent to O2
and O3
in opt
. Maybe some heuristics to alter the canonicalization of backedges. It'll be tricky though.
I've filed a LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=32022
A short update on this: With LLVM 4.0 and numba 0.34, the problem still persists.
0.34.0
10000 loops, best of 3: 190 µs per loop
Still happens in with 0.37.0 just prior to 0.38.0 RC, LLVM 6 is now in place.
Unfortunately, no one at LLVM really seems to care :(
We are now at Numba 0.51.2 and llvmlite 0.34.0 with LLVM 10. Perhaps it is worth checking if these issues remain?
It seems this issue is solved in newer versions of LLVM. The orange and apples example takes roughly the same time to finish on my machine.
$ ./main
ra = -0.123213 | rb = -0.123213
apple 0.076935
orange 0.077305
Also, the python script sent by @ml31415 runs faster on Numba 0.55
import numpy as np
import timeit
import numba
print('Numba version: {0}'.format(numba.__version__))
def measure(func, *args):
def setup():
return func(*args)
t = timeit.Timer(setup=setup)
name = func.__name__
_time = min(t.repeat(repeat=100, number=10))
print('{0} took {1}'.format(name, _time))
def nanmin_numbagg_1dim(a):
amin = np.infty
all_missing = 1
for ai in a.flat:
if ai <= amin:
amin = ai
all_missing = 0
if all_missing:
amin = np.nan
return amin
x = np.random.random(100000)
x[x>0.7] = np.nan
measure(nanmin_numbagg_1dim, x)
Numba version: 0.28.1
nanmin_numbagg_1dim took 4.0099985199049115e-07
Numba version: 0.37.0+653.g5168891b9
nanmin_numbagg_1dim took 2.8999966161791235e-07
Numba version: 0.55.1
nanmin_numbagg_1dim took 2.6100042305188254e-07
@guilhermeleobas Oh! that's good news! While you are at it, can you post the assembly code so we can compare it with the previous ones.
From the apple and oranges? or the python function?
edit: here is the assembly code for the apples and oranges example (Compiler Explorer)
.text
.file "b.cc"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3 # -- Begin function _Z5applePdi
.LCPI0_0:
.quad 9221120237041090560 # double NaN
.LCPI0_1:
.quad 9218868437227405312 # double +Inf
.text
.globl _Z5applePdi
.p2align 4, 0x90
.type _Z5applePdi,@function
_Z5applePdi: # @_Z5applePdi
.cfi_startproc
# %bb.0:
testl %esi, %esi
jle .LBB0_9
# %bb.1:
movl %esi, %edx
leaq -1(%rdx), %rcx
movl %edx, %r8d
andl $3, %r8d
xorl %eax, %eax
cmpq $3, %rcx
jae .LBB0_3
# %bb.2:
movl $1, %ecx
movsd .LCPI0_1(%rip), %xmm1 # xmm1 = mem[0],zero
xorl %esi, %esi
jmp .LBB0_5
.LBB0_3:
subq %r8, %rdx
movl $1, %ecx
movsd .LCPI0_1(%rip), %xmm1 # xmm1 = mem[0],zero
xorl %esi, %esi
.p2align 4, 0x90
.LBB0_4: # =>This Inner Loop Header: Depth=1
movsd (%rdi,%rsi,8), %xmm0 # xmm0 = mem[0],zero
movsd 8(%rdi,%rsi,8), %xmm2 # xmm2 = mem[0],zero
ucomisd %xmm0, %xmm1
movapd %xmm0, %xmm3
cmpnlesd %xmm1, %xmm3
movapd %xmm3, %xmm4
andnpd %xmm0, %xmm4
andpd %xmm1, %xmm3
orpd %xmm4, %xmm3
cmovael %eax, %ecx
ucomisd %xmm2, %xmm3
movapd %xmm2, %xmm0
cmpnlesd %xmm3, %xmm0
movapd %xmm0, %xmm1
andpd %xmm3, %xmm1
andnpd %xmm2, %xmm0
orpd %xmm1, %xmm0
cmovael %eax, %ecx
movsd 16(%rdi,%rsi,8), %xmm1 # xmm1 = mem[0],zero
ucomisd %xmm1, %xmm0
movapd %xmm1, %xmm2
cmpnlesd %xmm0, %xmm2
movapd %xmm2, %xmm3
andpd %xmm0, %xmm3
andnpd %xmm1, %xmm2
orpd %xmm3, %xmm2
movsd 24(%rdi,%rsi,8), %xmm0 # xmm0 = mem[0],zero
cmovael %eax, %ecx
ucomisd %xmm0, %xmm2
cmovael %eax, %ecx
movapd %xmm0, %xmm1
cmpnlesd %xmm2, %xmm1
movapd %xmm1, %xmm3
andpd %xmm2, %xmm3
andnpd %xmm0, %xmm1
orpd %xmm3, %xmm1
addq $4, %rsi
cmpq %rsi, %rdx
jne .LBB0_4
.LBB0_5:
movapd %xmm1, %xmm0
testq %r8, %r8
je .LBB0_8
# %bb.6:
leaq (%rdi,%rsi,8), %rax
xorl %edx, %edx
xorl %esi, %esi
.p2align 4, 0x90
.LBB0_7: # =>This Inner Loop Header: Depth=1
movsd (%rax,%rsi,8), %xmm2 # xmm2 = mem[0],zero
ucomisd %xmm2, %xmm1
cmovael %edx, %ecx
movapd %xmm2, %xmm0
cmpnlesd %xmm1, %xmm0
movapd %xmm0, %xmm3
andnpd %xmm2, %xmm3
andpd %xmm1, %xmm0
orpd %xmm3, %xmm0
addq $1, %rsi
movapd %xmm0, %xmm1
cmpq %rsi, %r8
jne .LBB0_7
.LBB0_8:
testl %ecx, %ecx
je .LBB0_10
.LBB0_9:
movsd .LCPI0_0(%rip), %xmm0 # xmm0 = mem[0],zero
.LBB0_10:
retq
.Lfunc_end0:
.size _Z5applePdi, .Lfunc_end0-_Z5applePdi
.cfi_endproc
# -- End function
.section .rodata.cst8,"aM",@progbits,8
.p2align 3 # -- Begin function _Z6orangePdi
.LCPI1_0:
.quad 9221120237041090560 # double NaN
.LCPI1_1:
.quad 9218868437227405312 # double +Inf
.text
.globl _Z6orangePdi
.p2align 4, 0x90
.type _Z6orangePdi,@function
_Z6orangePdi: # @_Z6orangePdi
.cfi_startproc
# %bb.0:
testl %esi, %esi
jle .LBB1_9
# %bb.1:
movl %esi, %edx
leaq -1(%rdx), %rcx
movl %edx, %r8d
andl $3, %r8d
xorl %eax, %eax
cmpq $3, %rcx
jae .LBB1_3
# %bb.2:
movl $1, %ecx
movsd .LCPI1_1(%rip), %xmm1 # xmm1 = mem[0],zero
xorl %esi, %esi
jmp .LBB1_5
.LBB1_3:
subq %r8, %rdx
movl $1, %ecx
movsd .LCPI1_1(%rip), %xmm1 # xmm1 = mem[0],zero
xorl %esi, %esi
.p2align 4, 0x90
.LBB1_4: # =>This Inner Loop Header: Depth=1
movsd (%rdi,%rsi,8), %xmm0 # xmm0 = mem[0],zero
movsd 8(%rdi,%rsi,8), %xmm2 # xmm2 = mem[0],zero
ucomisd %xmm0, %xmm1
movapd %xmm0, %xmm3
cmpnlesd %xmm1, %xmm3
movapd %xmm3, %xmm4
andnpd %xmm0, %xmm4
andpd %xmm1, %xmm3
orpd %xmm4, %xmm3
cmovael %eax, %ecx
ucomisd %xmm2, %xmm3
movapd %xmm2, %xmm0
cmpnlesd %xmm3, %xmm0
movapd %xmm0, %xmm1
andpd %xmm3, %xmm1
andnpd %xmm2, %xmm0
orpd %xmm1, %xmm0
cmovael %eax, %ecx
movsd 16(%rdi,%rsi,8), %xmm1 # xmm1 = mem[0],zero
ucomisd %xmm1, %xmm0
movapd %xmm1, %xmm2
cmpnlesd %xmm0, %xmm2
movapd %xmm2, %xmm3
andpd %xmm0, %xmm3
andnpd %xmm1, %xmm2
orpd %xmm3, %xmm2
movsd 24(%rdi,%rsi,8), %xmm0 # xmm0 = mem[0],zero
cmovael %eax, %ecx
ucomisd %xmm0, %xmm2
leaq 4(%rsi), %rsi
cmovael %eax, %ecx
movapd %xmm0, %xmm1
cmpnlesd %xmm2, %xmm1
movapd %xmm1, %xmm3
andpd %xmm2, %xmm3
andnpd %xmm0, %xmm1
orpd %xmm3, %xmm1
cmpq %rsi, %rdx
jne .LBB1_4
.LBB1_5:
movapd %xmm1, %xmm0
testq %r8, %r8
je .LBB1_8
# %bb.6:
leaq (%rdi,%rsi,8), %rax
xorl %edx, %edx
xorl %esi, %esi
.p2align 4, 0x90
.LBB1_7: # =>This Inner Loop Header: Depth=1
movsd (%rax,%rsi,8), %xmm2 # xmm2 = mem[0],zero
ucomisd %xmm2, %xmm1
cmovael %edx, %ecx
movapd %xmm2, %xmm0
cmpnlesd %xmm1, %xmm0
movapd %xmm0, %xmm3
andnpd %xmm2, %xmm3
andpd %xmm1, %xmm0
orpd %xmm3, %xmm0
addq $1, %rsi
movapd %xmm0, %xmm1
cmpq %rsi, %r8
jne .LBB1_7
.LBB1_8:
testl %ecx, %ecx
je .LBB1_10
.LBB1_9:
movsd .LCPI1_0(%rip), %xmm0 # xmm0 = mem[0],zero
.LBB1_10:
retq
.Lfunc_end1:
.size _Z6orangePdi, .Lfunc_end1-_Z6orangePdi
.cfi_endproc
# -- End function
.ident "clang version 10.0.0-4ubuntu1 "
.section ".note.GNU-stack","",@progbits
.addrsig
@guilhermeleobas can you check what an older clang would do? There's a clang=4.0.1 on conda-forge. I am seeing that the newer clang is producing equally slow code for the apple vs orange.
On my machine, clang 4.0.1 gives:
ra = -0.123213 | rb = -0.123213
apple 0.230488
orange 0.044694
clang 13 gives:
% clang -O3 -o newmain main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.227005
orange 0.199745
% clang -O1 -o newmain main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.213189
orange 0.188846
% clang -O0 -o newmain main.c nanmin.c
% ./newmain
ra = -0.123213 | rb = -0.123213
apple 0.201056
orange 0.157053
less optimization is better?!
For reference, our previously created LLVM bug can be found here now: https://github.com/llvm/llvm-project/issues/31370
@sklam, I had to use the pre-build binary from the llvm website. Clang 4.0.1 on conda-forge is only available for OSX.
$ ./clang4/clang/bin/clang --version
clang version 4.0.1 (tags/RELEASE_401/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/guilhermeleobas/./clang4/clang/bin
$ ./clang4/clang/bin/clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078439
orange 0.044631
$ ./clang4/clang/bin/clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078722
orange 0.044328
$ ./clang4/clang/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.101776
orange 0.101166
$ ./clang4/clang/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.101776
orange 0.101166
For comparison, system clang (10.0) gives the following results
$ clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078269
orange 0.077284
$ clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077202
orange 0.077039
$ clang -O1 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077812
orange 0.078302
$ clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.102861
orange 0.104594
edit: With clang 13, orange and apples takes roughly the same time, regardless of the optimization level used
$ ./clang13/bin/clang --version
clang version 13.0.1
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/guilhermeleobas/./clang13/bin
(base)
$ ./clang13/bin/clang -O3 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077494
orange 0.077163
(base)
$ ./clang13/bin/clang -O2 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.077468
orange 0.077152
(base)
$ ./clang13/bin/clang -O1 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.078136
orange 0.078680
(base)
$ ./clang13/bin/clang -O0 a.cc b.cc -o main && ./main
ra = -0.123213 | rb = -0.123213
apple 0.114825
orange 0.102888
@sklam, although LLVM seems to be producing slow code for both orange and apples example. Numba 0.55 is faster than 0.28 in running the original example.
Right, so much has changed that the C code no longer represents the Numba code here.
As for the C code, i think the problem comes from LLVM choosing to simd-vectorize that code pattern even though the the scalar path is better. I also think there is a problem of CPU arch and OS behavior. I am guessing @guilhermeleobas is running linux and i'm on OSX. Could it be the alignment of double arr[size]
vs the SIMD misalign penalty?
Referring to the discussion in #2176 it looks like there is a major speed regression from 0.28.1 to 0.29.0. The example function below and similar tight loops, doing merely more than nan checks, additions etc, only run at a fraction of the initial speed. This heavily affects the usability of numba as a replacement for some C implementations and renders 0.29.0 unusable for me. I'd also be happy to sponsor a bug bounty for this issue.