Open alkalinin opened 3 years ago
On my machine I use NumPy + OpenBLAS. In my tasks OpenBLAS is slightly outperform MKL.
I made it slightly simpler and more performant by explicity calling fastmath and nopython-mode (note also to explicitly call the function a first time to JIT compile it, such that the compilation time is not included in the timings).
Numba
import time
import numpy as np
from numba import njit
@njit(fastmath=True)
def get_M_nb(k, A):
M = np.exp(1j * k * np.sqrt(A ** 2 + A.transpose() ** 2))
return M
tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
a = np.linspace(0, 2 * np.pi, 1)
M = get_M_nb(100, a[:, np.newaxis])
for idx, N in enumerate(tests):
a = np.linspace(0, 2 * np.pi, N)
k = 100
A = a[:, np.newaxis]
start_time = time.time()
M = get_M_nb(k, A)
timings[idx] = time.time() - start_time
print(timings)
Numba parallel
import time
import numpy as np
from numba import njit, prange
@njit(fastmath=True, parallel=True)
def get_M_nb_parallel(k, a):
M = np.zeros((len(a), len(a)), dtype=np.complex128)
for i in prange(len(a)):
ais = np.square(a[i])
for j in range(len(a)):
M[i, j] = np.exp(1j * k * np.sqrt(ais + a[j] ** 2))
return M
tests = np.arange(1, 11) * 1000
timings = np.zeros(tests.size)
a = np.linspace(0, 2 * np.pi, 1)
M = get_M_nb_parallel(100, a)
for idx, N in enumerate(tests):
a = np.linspace(0, 2 * np.pi, N)
k = 100
start_time = time.time()
M = get_M_nb_parallel(k, a)
timings[idx] = time.time() - start_time
print(timings)
Benchmark results On my machine (the julia-vectorized version did not run on my machine). Note that in general it would make sense to run every N multiple times, especially for smaller N.
Backend | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 | 8000 | 9000 | 10000 |
---|---|---|---|---|---|---|---|---|---|---|
Julia | 0.005217 | 0.016317 | 0.032251 | 0.070325 | 0.157035 | 0.176432 | 0.161003 | 0.248487 | 0.269996 | 0.335104 |
Numba | 0.005998 | 0.025001 | 0.056002 | 0.102285 | 0.160998 | 0.231559 | 0.337511 | 0.416000 | 0.591660 | 0.686170 |
-parallel | 0.010999 | 0.019000 | 0.048000 | 0.088000 | 0.142512 | 0.260027 | 0.303027 | 0.390030 | 0.510026 | 0.606508 |
Note that in general this task should be left to GPUs, which do this >10x as fast as CPU (RTX 2080Ti with PyTorch).
Nice! Julia seems to be faster in my PC as well. The vectorized version, almost by a factor of 3.
I believe LoopVectorization.jl requires Julia 1.6 or so. If you've installed Julia with tools like apt-get you probably got version 1.0.
Introduction to Numba
I recommend to use Numba to improve Python performance. Numba is a High Performance JIT compiler for the numerical Python. In general you do not need to change your Python code, only to add some annotations to increase performance of critical procedures.
On my machine I got the following results.
Original NumPy timing
Numba NumPy timing
Numba Parallel explicit loops timing
Numba supports automatic parallelization of the explicit loops