No performance improvement with Intel's SVML library

cmartinezdem commented 4 years ago

[ ] I am using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
[x] I have included below a minimal working reproducer (if you are unsure how to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

Hello,

I am running Numba 0.48 (last version available in PyPi) on Windows 10 (64 bits) with Python 3.8.2 and I am not able to get better performance after installing Intel's icc_rt library. It seems that SVML libraries are detected by Numba, as it can be checked in numba -s output:

Time Stamp 2020-04-15 10:02:04.120355

Hardware Information Machine : AMD64 CPU Name : skylake CPU count : 8 CPU Features : 64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed sahf sgx sse sse2 sse3 sse4.1 sse4.2 ssse3 xsave xsavec xsaveopt xsaves

OS Information Platform : Windows-10-10.0.18362-SP0 Release : 10 System Name : Windows Version : 10.0.18362 OS specific info : 1010.0.18362SP0

Python Information Python Compiler : MSC v.1916 64 bit (AMD64) Python Implementation : CPython Python Version : 3.8.2 Python Locale : en_GB cp1252

LLVM information LLVM version : 8.0.0

CUDA Information Found 1 CUDA devices id 0 b'GeForce MX150' [SUPPORTED] compute capability: 6.1 pci device id: 0 pci bus id: 1 Summary: 1/1 devices are supported CUDA driver version : 10000 CUDA libraries: Error: Probing CUDA failed (device and driver present, runtime problem?)

ROC Information ROC available : False Error initialising ROC due to : No ROC toolchains found. No HSA Agents found, encountered exception when searching: Error at driver init:

HSA is not currently supported on this platform (win32). :

SVML Information SVML state, config.USING_SVML : True SVML library found and loaded : True llvmlite using SVML patched LLVM : True SVML operational : True

Threading Layer Information TBB Threading layer available : True OpenMP Threading layer available : False +--> Disabled due to : Unknown import problem. Workqueue Threading layer available : True

Numba Environment Variable Information None set.

Conda Information Conda not present/not working. Error was [WinError 2] The system cannot find the file specified

However, the following code takes the same amount of time to run with or without SVML libraries:

import numpy as np
import numba as nb

@nb.njit(fastmath={'nnan', 'ninf', 'nsz', 'arcp', 'contract'})
def foo(a, b):
    assert a.size == b.size

    error = 0.0
    for i in range(a.size):
        ratio = a[i] / b[i]

        magnitude_error = np.log10(np.abs(ratio))
        phase_error = ((np.angle(ratio) + np.pi) % (2.0 * np.pi)) - np.pi

        error += ((magnitude_error**2.0 + phase_error**2.0) / 2.0)

    return error

a = np.arange(1, 1000) + 1j * np.arange(1, 1000)
b = np.arange(1, 1000) + 1j * np.arange(1, 1000)

# Compile function
foo(a, b)

# Check performance
%timeit foo(a, b)

I tried to install Numba and icc_rt with conda and I get the same result. Shouldn't perform better with SVML libraries?

Thank you so much for your support.

stuartarchibald commented 4 years ago

Thanks for the report. SVML should make your code faster, but only where vectorized math functions are applicable.

In the code sample provided, this line:

 ratio = a[i] / b[i]

would appear to Numba as doing a division with a complex scalar as the numerator and denominator. As Numba doesn't differentiate between NumPy and Python scalars, this would end up as a division operation based on the python semantics, such that a branch would be added into the loop to handle raising a ZeroDivisionError in the case of b[i] being zero, this branch would prevent the loop vectorizing as well as possible. Further, not specifying 'reassoc' in the flags means the loop cannot vectorize as the reduction on error has to be executed in order. Net result of these two issues results in no SVML instructions being selected.

"Fixing" the code to look like:

import numpy as np
import numba as nb

@nb.njit(fastmath=True)
# ^--- set to True, selectively applying parts of fastmath flags rarely goes well in practice.
# See https://github.com/numba/numba/pull/3847#issuecomment-472822965
def foo(a, b):
    assert a.size == b.size

    error = 0.0
    for i in range(a.size):
        ratio = np.divide(a[i], b[i]) # <--- explicit numpy divide

        magnitude_error = np.log10(np.abs(ratio))
        phase_error = ((np.angle(ratio) + np.pi) % (2.0 * np.pi)) - np.pi

        error += ((magnitude_error**2.0 + phase_error**2.0) / 2.0)

    return error

a = np.arange(1, 1000) + 1j * np.arange(1, 1000)
b = np.arange(1, 1000) + 1j * np.arange(1, 1000)

# Compile function
foo(a, b)

print(foo.inspect_asm(foo.signatures[0]))

should print out assembly that contains things like:

    movabsq $__svml_hypot2, %rax
    callq   *%rax
    movabsq $__svml_log102, %rax
    callq   *%rax

where the __svml_ prefix indicates use of an SVML intrinsic.

Hope this helps?

cmartinezdem commented 4 years ago

Hi @stuartarchibald.

I have tried what you suggest, but there is no __svml_ prefix in assembly.

The following checks return False after compiling the function:

print('svml' in foo.inspect_asm(foo.signatures[0]))
print('intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]))

Thank you so much!

stuartarchibald commented 4 years ago

Locally, I have this:

$ python -c 'import numba; print(numba.__version__)'
0.48.0

$ cat issue5562.py
import numpy as np
import numba as nb

@nb.njit(fastmath=True)
def foo(a, b):
    assert a.size == b.size

    error = 0.0
    for i in range(a.size):
        ratio = np.divide(a[i], b[i])

        magnitude_error = np.log10(np.abs(ratio))
        phase_error = ((np.angle(ratio) + np.pi) % (2.0 * np.pi)) - np.pi

        error += ((magnitude_error**2.0 + phase_error**2.0) / 2.0)

    return error

a = np.arange(1, 1000) + 1j * np.arange(1, 1000)
b = np.arange(1, 1000) + 1j * np.arange(1, 1000)

# Compile function
foo(a, b)

print(foo.inspect_asm(foo.signatures[0]))

$ python issue5562.py |grep svml
        movabsq $__svml_hypot2, %rax
        movabsq $__svml_log102, %rax
        movabsq $__svml_atan22, %rax

so it's working at least somewhere. Will try emulating your CPU shortly.

stuartarchibald commented 4 years ago

Emulating your skylake chip and using this code:

import numpy as np
import numba as nb
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def foo(a, b):
    assert a.size == b.size

    error = 0.0
    for i in range(a.size):
        ratio = np.divide(a[i], b[i])

        magnitude_error = np.log10(np.abs(ratio))
        phase_error = ((np.angle(ratio) + np.pi) % (2.0 * np.pi)) - np.pi

        error += ((magnitude_error**2.0 + phase_error**2.0) / 2.0)

    return error

a = np.arange(1, 1000) + 1j * np.arange(1, 1000)
b = np.arange(1, 1000) + 1j * np.arange(1, 1000)

# Compile function
ty = nb.types.complex128[:]
foo.compile((ty, ty))

print(foo.inspect_asm(foo.signatures[0]))

and running like so:

NUMBA_CPU_NAME='skylake' NUMBA_CPU_FEATURES='' python issue5562.py |grep svml

gives:

LV: Interleaving because of reductions.
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (4) in foo
Setting best plan to VF=4, UF=1
LV: Interleaving disabled by the pass manager
LV(SVML): Vector call inst:  %46 = call fast <4 x double> @__svml_hypot4(<4 x double> %45, <4 x double> %43)
LV(SVML): Type Bit Width: 64
LV(SVML): Current VL: 4
LV(SVML): Vector Bit Width: 256
LV(SVML): Legal Target VL: 4
LV: Completed SVML legalization.
 LegalV:   %46 = call fast intel_svmlcc <4 x double> @__svml_hypot4(<4 x double> %45, <4 x double> %43)
LV(SVML): Vector call inst:  %47 = call fast <4 x double> @__svml_log104(<4 x double> %46)
LV(SVML): Type Bit Width: 64
LV(SVML): Current VL: 4
LV(SVML): Vector Bit Width: 256
LV(SVML): Legal Target VL: 4
LV: Completed SVML legalization.
 LegalV:   %47 = call fast intel_svmlcc <4 x double> @__svml_log104(<4 x double> %46)
LV(SVML): Vector call inst:  %48 = call fast <4 x double> @__svml_atan24(<4 x double> %43, <4 x double> %45)
LV(SVML): Type Bit Width: 64
LV(SVML): Current VL: 4
LV(SVML): Vector Bit Width: 256
LV(SVML): Legal Target VL: 4
LV: Completed SVML legalization.
 LegalV:   %48 = call fast intel_svmlcc <4 x double> @__svml_atan24(<4 x double> %43, <4 x double> %45)
        movabsq $__svml_atan24, %r13
        movabsq $__svml_hypot4, %rax
        movabsq $__svml_log104, %rax

stuartarchibald commented 4 years ago

Does a function like:

from numba import njit
import numpy as np

@njit(fastmath=True)
def foo(x):
    acc = 0
    for i in x:
        y = np.sqrt(i)
        acc += y
    return acc

foo(np.arange(100.))
print(foo.inspect_asm(foo.signatures[0]))

produce SVML calls for you?

cmartinezdem commented 4 years ago

Yes, it does. I can see the following line in the assembly code:

movabsq $__svml_sqrt4, %r12

cmartinezdem commented 4 years ago

Doing

import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

# Compile function
ty = nb.types.complex128[:]
foo.compile((ty, ty))

print(foo.inspect_asm(foo.signatures[0]))

gives no SVML output in my case.

stuartarchibald commented 4 years ago

Looks like you are not using Anaconda distro numba/llvmlite/llvm, I wonder if that's causing the difference we're seeing?

cmartinezdem commented 4 years ago

I have downloaded the official Anaconda distribution. This is the output of numba -s (there are some CUDA related errors, but I don't care about them):

Time Stamp 2020-04-15 18:55:32.326376

Hardware Information Machine : AMD64 CPU Name : skylake CPU count : 8 CPU Features : 64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed sahf sgx sse sse2 sse3 sse4.1 sse4.2 ssse3 xsave xsavec xsaveopt xsaves

OS Information Platform : Windows-10-10.0.18362-SP0 Release : 10 System Name : Windows Version : 10.0.18362 OS specific info : 1010.0.18362SP0

Python Information Python Compiler : MSC v.1916 64 bit (AMD64) Python Implementation : CPython Python Version : 3.7.6 Python Locale : en_GB cp1252

LLVM information LLVM version : 8.0.0

CUDA Information Found 1 CUDA devices id 0 b'GeForce MX150' [SUPPORTED] compute capability: 6.1 pci device id: 0 pci bus id: 1 Summary: 1/1 devices are supported CUDA driver version : 10000 CUDA libraries: Finding cublas from named cublas.dll trying to open library... ERROR: failed to open cublas: [WinError 126] The specified module could not be found Finding cusparse from named cusparse.dll trying to open library... ERROR: failed to open cusparse: [WinError 126] The specified module could not be found Finding cufft from named cufft.dll trying to open library... ERROR: failed to open cufft: [WinError 126] The specified module could not be found Finding curand from named curand.dll trying to open library... ERROR: failed to open curand: [WinError 126] The specified module could not be found Finding nvvm from named nvvm.dll trying to open library... ERROR: failed to open nvvm: [WinError 126] The specified module could not be found Finding libdevice from searching for compute_20... ERROR: can't open libdevice for compute_20 searching for compute_30... ERROR: can't open libdevice for compute_30 searching for compute_35... ERROR: can't open libdevice for compute_35 searching for compute_50... ERROR: can't open libdevice for compute_50

ROC Information ROC available : False Error initialising ROC due to : No ROC toolchains found. No HSA Agents found, encountered exception when searching: Error at driver init:

HSA is not currently supported on this platform (win32). :

SVML Information SVML state, config.USING_SVML : True SVML library found and loaded : True llvmlite using SVML patched LLVM : True SVML operational : True

Threading Layer Information TBB Threading layer available : True OpenMP Threading layer available : True Workqueue Threading layer available : True

Numba Environment Variable Information None set.

Conda Information conda_build_version : 3.18.11 conda_env_version : 4.8.2 platform : win-64 python_version : 3.7.6.final.0 root_writable : True

After testing your previous example:

import numpy as np
import numba as nb
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def foo(a, b):
    assert a.size == b.size

    error = 0.0
    for i in range(a.size):
        ratio = np.divide(a[i], b[i])

        magnitude_error = np.log10(np.abs(ratio))
        phase_error = ((np.angle(ratio) + np.pi) % (2.0 * np.pi)) - np.pi

        error += ((magnitude_error**2.0 + phase_error**2.0) / 2.0)

    return error

a = np.arange(1, 1000) + 1j * np.arange(1, 1000)
b = np.arange(1, 1000) + 1j * np.arange(1, 1000)

# Compile function
ty = nb.types.complex128[:]
foo.compile((ty, ty))

print(foo.inspect_asm(foo.signatures[0]))

the output is the same, it contains nothing about SVML. The LLVM debugger prints the following code:

LV: Checking a loop in "_ZN8main7foo$246E5ArrayI10complex128Li1E1A7mutable7alignedE5ArrayI10complex128Li1E1A7mutable7alignedE" from foo LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B36 LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

LV: Checking a loop in "_ZN8main7foo$247E5ArrayI10complex128Li1E1A7mutable7alignedE5ArrayI10complex128Li1E1A7mutable7alignedE" from foo LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B36 LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality. [I 20:59:48.441 NotebookApp] Saving file at /notebooks/NUMBA_TESTS.ipynb

LV: Checking a loop in "_ZN8main7foo$248E5ArrayI10complex128Li1E1A7mutable7alignedE5ArrayI10complex128Li1E1A7mutable7alignedE" from foo LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B36 LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

I don't know whether it tells you something...

It seems that the issue is not related to not being using Anaconda distribution.

stuartarchibald commented 4 years ago

Thanks for reporting back. I think I know what is likely to be causing this. Just to confirm, could you please try running these:

import numpy as np
import numba as nb
import math
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def foo_hypot(a,):

    error = 0.0
    for i in range(a.size):
        error += math.hypot(a[i].real, a[i].imag)

    return error

ty = nb.types.complex128[:]
foo_hypot.compile((ty,))

and then

import numpy as np
import numba as nb
import math
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def foo_atan2(a,):

    error = 0.0
    for i in range(a.size):
        error += math.atan2(a[i].real, a[i].imag)

    return error

ty = nb.types.complex128[:]
foo_atan2.compile((ty,))

and then dump out the output here please?

On windows I'd expect the first (foo_hypot) to say it "Found a non-intrinsic callsite" (which will prevent loop vectorization), and the second (foo_atan2) to produce a load of stuff, but pretty near the top you'll see "We can vectorize this loop!".

If this is the case, I'll explain why and then try and work out how to fix it. Thanks.

cmartinezdem commented 4 years ago

Hi @stuartarchibald.

This the output of the first function:

LV: Checking a loop in "_ZN8main13foo_hypot$241E5ArrayI10complex128Li1E1A7mutable7alignedE" from foo_hypot LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B18.endif LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

LV: Checking a loop in "_ZN7cpython8main13foo_hypot$241E5ArrayI10complex128Li1E1A7mutable7alignedE" from foo_hypot LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B18.endif.i LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

The second one:

LV: Checking a loop in "_ZN8main13foo_atan2$241E5ArrayI10complex128Li1E1A7mutable7alignedE" from foo_atan2 LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B18 LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

LV: Checking a loop in "_ZN7cpython8main13foo_atan2$241E5ArrayI10complex128Li1E1A7mutable7alignedE" from foo_atan2 LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B18.i LV: Found an induction variable. LV: Found a non-intrinsic callsite. LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

Actually, both outputs are pretty similar. The result is the same if I do it with Anaconda's Numba distribution.

Thank you for your support.

stuartarchibald commented 4 years ago

Thanks for trying those. I'm now convinced that there's multiple things going on here. Can you please try this one?:

import numpy as np
import numba as nb
import math
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def foo_pow(a,):

    error = 0.0
    for i in range(a.size):
        error += math.pow(a[i].real, a[i].imag)

    return error

ty = nb.types.complex128[:]
foo_pow.compile((ty,))

if this vectorizes then my internal hypothesis is correct and I'll explain all, thanks in advance.

cmartinezdem commented 4 years ago

This one works! This are the first lines of the output:

LV: Checking a loop in "_ZN8main11foo_pow$243E5ArrayI10complex128Li1E1A7mutable7alignedE" from foo_pow LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: B18 LV: Found an induction variable. LV: We can vectorize this loop! LV: The Smallest and Widest types: 64 / 64 bits. LV: The Widest register safe to use is: 256 bits. LV: Found uniform instruction: %exitcond = icmp eq i64 %.155, %arg.a.2 LV: Found uniform instruction: %exitcond = icmp eq i64 %.155, %arg.a.2 LV: Scalarizing: %.238 = mul i64 %.76.04, %arg.a.6.0 LV: Scalarizing: %.240 = add i64 %.238, %.239 LV: Scalarizing: %.241 = inttoptr i64 %.240 to { double, double } LV: Scalarizing: %.242 = getelementptr inbounds { double, double }, { double, double } %.241, i64 0, i32 0 LV: Scalarizing: %.243 = load double, double %.242, align 8 LV: Scalarizing: %.244 = getelementptr inbounds { double, double }, { double, double } %.241, i64 0, i32 1 LV: Scalarizing: %.245 = load double, double* %.244, align 8 LV: Scalarizing: %.315 = tail call fast double @llvm.pow.f64(double %.243, double %.245) LV: Scalarizing: %.327 = fadd fast double %.315, %error.06

stuartarchibald commented 4 years ago

Great, thanks... so this is what I think is going on.

First, key information: a non-intrinsic call in a loop will break the vectorizer, vectorization across an intrinsic math function is a prerequisite for SVML instruction selection.

In the case of hypot, historically, windows "spells" the intrinsic function differently: https://github.com/numba/numba/blob/release0.48/numba/targets/mathimpl.py#L363-L369 this means it appears as a non-intrinsic call in LLVM IR and breaks loop vectorization. I think, given that windows has moved on a load since that was added (most probably added to help deal with Python 2 and older visual studios), that it'd be safe to move that to just use the LLVM intrinsic, but will need to check. That'll then permit the loop to vectorize and because the intrinsic is "spelled" as hypot SVML should be able to do instruction selection replacement https://github.com/numba/llvmlite/blob/master/conda-recipes/D47188-svml-VF.patch#L182

In the case of atan2, 0.48 contains code that was historically needed to "fix" the function on windows by replacing it with a version shipped by Numba: https://github.com/numba/numba/blob/release0.48/numba/targets/mathimpl.py#L323-L327 this has been removed in 0.49 as atan2 on windows is now seemingly fine. Again, the use of this function would prevent vectorization as it's non-intrinsic.

I'd expect pow to work as nothing "special" was needed to be done to it to get it to work ok on windows.

I think I can probably fix the hypot problem, which in turn will fix your original code where this line:

magnitude_error = np.log10(np.abs(ratio))

is the cause of the vectorization breakage. This is because np.abs(complex value) is compiled as a hypot call, and as explained above, the non-intrinsic hypot function on windows would prevent loop vectorization and also prevent use of SVML hypot.

Thanks for helping diagnose this.

cmartinezdem commented 4 years ago

OK, very interesting!

I usually work with complex numbers, so being able to vectorize this kind of functions would be great.

Thank you so much.

cmartinezdem commented 4 years ago

After carrying out some tests, I have found a simple workaround for those who want to speed up np.abs() for complex numbers in Windows 10:

@nb.njit(fastmath=True)
def foo_abs(a):
    output = np.empty(a.size, dtype=np.float64)

    for i in range(a.size):
        output[i] = np.sqrt(a[i].real**2.0 + a[i].imag**2.0)

    return output

This code gets vectorized in Windows 10 with Numba 0.49 and Python 3.8.2. However, being able to vectorize hypot function would still be great.

Thank you for your support.

stuartarchibald commented 4 years ago

Great, glad something working.

davo417 commented 3 years ago

I'm not having this problem, but I read the entire conversation just because @stuartarchibald cares so much and takes the time to explain the problem and the solution.

stuartarchibald commented 2 years ago

Thanks for your support @davidcode43w, much appreciated :-)

AdityaSoni19031997 commented 2 years ago

Even though I am not part of this thread, this is a beautiful conversation! Thanks!


# code to reproduce the behaviour while computing Euclidean Distances.

import numba
from numba import prange
import numpy as np

# generate
a = np.random.rand(4153344, 128).astype("float32")
b = np.random.rand(128).astype("float32")

print(type(a))
print(a.shape)
print(b.shape)

@numba.njit(fastmath=True, parallel=True)
def numba_ecd_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in prange(a.shape[0]):
        d = 0
        for c in prange(128):
            d += (b[c] - a[r, c])**2
        dist[r] = d
    return dist

# let it compile itself for the first time

import time
t = time.time()
res =  numba_ecd_dist(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

t = time.time()
res = numba_ecd_dist(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

# we are comparing the above with scikit's ED's.
from sklearn.metrics.pairwise import euclidean_distances as ED
t = time.time()
res =  ED(a, b.reshape(1,-1)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

# o/p

"""
wrt numba section ('0.54.1')
<class 'numpy.ndarray'>
(4153344, 128)
(128,)
(4153344,)
1.011427879333496
(4153344,)
0.276076078414917

wrt scikit's ED section
(4153344, 1)
1.041219711303711
"""
# print('svml' in numba_ecd_dist.inspect_asm(numba_ecd_dist.signatures[0]))              # False
# print('intel_svmlcc' in numba_ecd_dist.inspect_llvm(numba_ecd_dist.signatures[0])) # False

So, my question is, Can the numba function be pushed more? (in terms for performance/speed)? I am on Mac-OS intel i5 and '0.54.1' as numba's version and on Py 3.7.x!

Thanks for the help!

PS -: My jupyter notebook also dies sometimes computing this but it runs without dying when i run it on ipython interactively myself etc.

cc @stuartarchibald ! (Thanks!)

numba -s output

~ numba -s System info: -------------------------------------------------------------------------------- __Time Stamp__ Report started (local time) : 2021-12-20 06:00:19.108374 UTC start time : 2021-12-20 00:30:19.108385 Running time (s) : 11.401785 __Hardware Information__ Machine : x86_64 CPU Name : icelake-client CPU Count : 8 Number of accessible CPUs : ? List of accessible CPUs cores : ? CFS Restrictions (CPUs worth of runtime) : None CPU Features : 64bit adx aes avx avx2 avx512bitalg avx512bw avx512cd avx512dq avx512f avx512ifma avx512vbmi avx512vbmi2 avx512vl avx512vnni avx512vpopcntdq bmi bmi2 clflushopt cmov cx16 cx8 f16c fma fsgsbase fxsr gfni invpcid lzcnt mmx movbe pclmul popcnt prfchw rdpid rdrnd rdseed sahf sgx sha sse sse2 sse3 sse4.1 sse4.2 ssse3 vaes vpclmulqdq xsave xsavec xsaveopt xsaves Memory Total (MB) : 16384 Memory Available (MB) : 11364 __OS Information__ Platform Name : Darwin-19.6.0-x86_64-i386-64bit Platform Release : 19.6.0 OS Name : Darwin OS Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 OS Specific Version : 10.15.7 x86_64 Libc Version : ? __Python Information__ Python Compiler : Clang 10.0.0 Python Implementation : CPython Python Version : 3.7.9 Python Locale : None.UTF-8 __Numba Toolchain Versions__ Numba Version : 0.54.1 llvmlite Version : 0.37.0 __LLVM Information__ LLVM Version : 11.1.0 __CUDA Information__ CUDA Device Initialized : False CUDA Driver Version : ? CUDA Runtime Version : ? CUDA Detect Output: None CUDA Libraries Test Output: None __SVML Information__ SVML State, config.USING_SVML : False SVML Library Loaded : False llvmlite Using SVML Patched LLVM : True SVML Operational : False __Threading Layer Information__ TBB Threading Layer Available : True +-->TBB imported successfully. OpenMP Threading Layer Available : True +-->Vendor: Intel Workqueue Threading Layer Available : True +-->Workqueue imported successfully. __Numba Environment Variable Information__ None found. __Conda Information__ Conda Build : 3.21.4 Conda Env : 4.10.3 Conda Platform : osx-64 Conda Python Version : 3.8.8.final.0 Conda Root Writable : True

numba / numba

No performance improvement with Intel's SVML library #5562