Improve Python perfromance with Numba JIT

alkalinin commented 3 years ago

Introduction to Numba

I recommend to use Numba to improve Python performance. Numba is a High Performance JIT compiler for the numerical Python. In general you do not need to change your Python code, only to add some annotations to increase performance of critical procedures.

On my machine I got the following results.

Original NumPy timing

C:\Temp\julia-numpy-fortran-test>C:/Dev/Python/python-3.9.5/python.exe c:/Temp/julia-numpy-fortran-test/test.py
[0.0897615 0.28826523 0.58348584 1.00431323 1.70470142 2.31290269 3.16431308 4.26509619 5.34526372 6.57406735]

N = 10000, t=1.0

Numba NumPy timing

import time
import numpy as np
from numba import jit, int32, float64

@jit((int32, float64[:, :]))
def get_M_nb(k, A):
    M = np.exp(1j * k * np.sqrt(A ** 2 + A.T ** 2))
    return M

tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
for idx, N in enumerate(tests):
    a = np.linspace(0, 2 * np.pi, N)
    k = 100

    A = a[:, np.newaxis]

    start_time = time.time()
    M = get_M_nb(k, A)
    timings[idx] = time.time() - start_time
print(timings)

C:\Temp\julia-numpy-fortran-test>C:/Dev/Python/python-3.9.5/python.exe c:/Temp/julia-numpy-fortran-test/test_numba.py
[0.02992105 0.12367177 0.2922132  0.59727883 0.78190374 1.14347076 1.44413185 1.9568162  2.57067299 3.11078811]

N = 10000, t=0.47

Numba Parallel explicit loops timing

Numba supports automatic parallelization of the explicit loops

import time
import numpy as np
from numba import jit, int32, float64, prange

@jit((int32, float64[:]), parallel=True)
def get_M_nb_parallel(k, a):
    M = np.zeros((len(a), len(a)), dtype=np.complex128)
    for i in prange(len(a)):
        for j in range(len(a)):
            M[i, j] = np.exp(1j * k * np.sqrt(a[i] ** 2 + a[j] ** 2))

    return M

tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
for idx, N in enumerate(tests):
    a = np.linspace(0, 2 * np.pi, N)
    k = 100

    start_time = time.time()
    M = get_M_nb_parallel(k, a)
    timings[idx] = time.time() - start_time
print(timings)

C:\Temp\julia-numpy-fortran-test>C:/Dev/Python/python-3.9.5/python.exe c:/Temp/julia-numpy-fortran-test/test_numba_parallel.py
[0.02892065 0.09873509 0.17852354 0.20046306 0.32313585 0.47074032 0.68221402 1.00278997 1.01827526 1.38133717]
N = 10000, t=0.21

alkalinin commented 3 years ago

On my machine I use NumPy + OpenBLAS. In my tasks OpenBLAS is slightly outperform MKL.

elephaint commented 3 years ago

I made it slightly simpler and more performant by explicity calling fastmath and nopython-mode (note also to explicitly call the function a first time to JIT compile it, such that the compilation time is not included in the timings).

Numba

import time
import numpy as np
from numba import njit

@njit(fastmath=True)
def get_M_nb(k, A):
    M = np.exp(1j * k * np.sqrt(A ** 2 + A.transpose() ** 2))
    return M

tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
a = np.linspace(0, 2 * np.pi, 1)
M = get_M_nb(100, a[:, np.newaxis])

for idx, N in enumerate(tests):
    a = np.linspace(0, 2 * np.pi, N)
    k = 100

    A = a[:, np.newaxis]

    start_time = time.time()
    M = get_M_nb(k, A)
    timings[idx] = time.time() - start_time
print(timings)

Numba parallel

import time
import numpy as np
from numba import njit, prange

@njit(fastmath=True, parallel=True)
def get_M_nb_parallel(k, a):
    M = np.zeros((len(a), len(a)), dtype=np.complex128)
    for i in prange(len(a)):
        ais = np.square(a[i])
        for j in range(len(a)):
            M[i, j] = np.exp(1j * k * np.sqrt(ais + a[j] ** 2))

    return M

tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
a = np.linspace(0, 2 * np.pi, 1)
M = get_M_nb_parallel(100, a)

for idx, N in enumerate(tests):
    a = np.linspace(0, 2 * np.pi, N)
    k = 100

    start_time = time.time()
    M = get_M_nb_parallel(k, a)
    timings[idx] = time.time() - start_time
print(timings)

Benchmark results On my machine (the julia-vectorized version did not run on my machine). Note that in general it would make sense to run every N multiple times, especially for smaller N.

Backend	1000	2000	3000	4000	5000	6000	7000	8000	9000	10000
Julia	0.005217	0.016317	0.032251	0.070325	0.157035	0.176432	0.161003	0.248487	0.269996	0.335104
Numba	0.005998	0.025001	0.056002	0.102285	0.160998	0.231559	0.337511	0.416000	0.591660	0.686170
-parallel	0.010999	0.019000	0.048000	0.088000	0.142512	0.260027	0.303027	0.390030	0.510026	0.606508

Note that in general this task should be left to GPUs, which do this >10x as fast as CPU (RTX 2080Ti with PyTorch).

mdmaas commented 3 years ago

Nice! Julia seems to be faster in my PC as well. The vectorized version, almost by a factor of 3.

mdmaas commented 3 years ago

I believe LoopVectorization.jl requires Julia 1.6 or so. If you've installed Julia with tools like apt-get you probably got version 1.0.

mdmaas / julia-numpy-fortran-test