Open paugier opened 3 years ago
Are you sure it is simd instructions? At least removing the @simd from the julia loop does not make a difference (does julia use simd instructions on broadcasting?). Similarly compiling with or without -DUSE_XSIMD does not seem to make a difference for the pythran code.
One thing I notice is when I change:
for i in range(dim):
vector[i] = position0[i] - positions[index_p1, i]
...
for i in range(dim):
accelerations[index_p0, i] -= coef * mass1 *vector[i]
accelerations[index_p1, i] += coef *mass0*vector[i]
to
vector = position0 - positions[index_p1]
...
accelerations[index_p0] -= coef * mass1 * vector
accelerations[index_p1] += coef*mass0*vector
I see a 10x slowdown.
Can you give a try to the following code (both _opt
and the base one) compiled with clang, using that branch https://github.com/serge-sans-paille/pythran/pull/1712 and the -mfma
flag?
from math import sqrt
import numpy as np
dim = 3
#pythran export compute(float[:,3], float[:], float[:,3])
def compute(accelerations, masses, positions):
nb_particules = masses.size
vector = np.empty_like(positions[0])
acc0 = np.empty_like(accelerations[0])
for index_p0 in range(nb_particules - 1):
np.copyto(acc0, accelerations[index_p0])
for index_p1 in range(index_p0 + 1, nb_particules):
np.copyto(vector, positions[index_p0] - positions[index_p1])
distance = sqrt(np.sum(vector ** 2))
coef = np.power(distance, -3)
np.copyto(acc0, acc0 + coef * masses[index_p1] * vector)
accelerations[index_p1] += coef * masses[index_p0] * vector
accelerations[index_p0] = -acc0
#pythran export compute_opt(float[:,:], float[:], float[:,:])
def compute_opt(accelerations, masses, positions):
nb_particules = masses.size
vector = np.empty(dim)
accs = np.empty(dim)
for index_p0 in range(nb_particules - 1):
position0 = positions[index_p0]
mass0 = masses[index_p0]
accs.fill(0)
for index_p1 in range(index_p0 + 1, nb_particules):
for i in range(dim):
vector[i] = position0[i] - positions[index_p1, i]
d2 = 0.
for i in range(dim):
d2 = d2 + vector[i] * vector[i]
distance3 = d2 * sqrt(d2)
coef1 = masses[index_p1] / distance3
coef0 = mass0 / distance3
for i in range(dim):
accs[i] = accs[i] + coef1 * vector[i]
accelerations[index_p1, i] = accelerations[index_p1, i] + coef0 * vector[i]
accelerations[index_p0] -= accs
Not faster at home :-( (opt1 and opt2 are the functions in your previous message)
shape= (1024, 3)
compute : 1 * norm
norm = 0.00301 s
compute_opt : 0.998 * norm
compute_opt1 : 2.63 * norm
compute_opt2 : 1.02 * norm
Same here
shape= (1024, 3)
compute : 1 * norm
norm = 0.0029 s
compute_opt : 1 * norm
compute2 : 6.85 * norm
compute3 : 2.41 * norm
compute_opt3 : 1 * norm
(2 is the compute where I changed the loop over vector to vector operations, 3 are the functions from your message).
@cycomanic / @paugier even with -mfma
?
@serge-sans-paille yes, -mfma did not make a difference (neither with gcc nor clang). Could it be an alignment issue, so that simd instructions are not used. Is there a way to check?
I confirm again. -mfma
has no effect.
$ pythran -Ofast -march=native -mfma -vv microbench_pythran.py
INFO: sys file exists: /home/pierre/Dev/pythran/pythran/pythran.cfg
INFO: platform file exists: /home/pierre/Dev/pythran/pythran/pythran-linux.cfg
INFO: user file exists: /home/pierre/.pythranrc
INFO: pythranrc section [pythran] is valid and options are correct
INFO: pythranrc section [typing] is valid and options are correct
INFO: pythranrc section [compiler] is valid and options are correct
running build_ext
new_compiler returns <class 'distutils.unixccompiler.UnixCCompiler'>
building 'microbench_pythran' extension
C compiler: clang++-10 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC
creating /tmp/tmp6jf_ia_m/tmp
compile options: '-DENABLE_PYTHON_MODULE -D__PYTHRAN__=3 -DPYTHRAN_BLAS_OPENBLAS -I/home/pierre/Dev/pythran/pythran -I/home/pierre/Dev/pythran/pythran/pythonic/patch -I/home/pierre/.pyenv/versions/3.8.2/lib/python3.8/site-packages/numpy/core/include -I/home/pierre/.pyenv/versions/3.8.2/include/python3.8 -c'
extra options: '-std=c++11 -fno-math-errno -w -fvisibility=hidden -fno-wrapv -Ofast -march=native -mfma'
clang++-10: /tmp/tmpu_itryoy.cpp
clang++-10 -pthread -shared -L/home/pierre/.pyenv/versions/3.8.2/lib -L/home/pierre/.pyenv/versions/3.8.2/lib /tmp/tmp6jf_ia_m/tmp/tmpu_itryoy.o -L/usr/lib/x86_64-linux-gnu -lopenblas -lopenblas -o /tmp/tmpuddb59eb/microbench_pythran.cpython-38-x86_64-linux-gnu.so -fvisibility=hidden -Wl,-strip-all -Ofast -march=native -mfma
INFO: Generated module: microbench_pythran
INFO: Output: /home/pierre/Dev/nbabel/py/microbench/__pythran__/microbench_pythran.cpython-38-x86_64-linux-gnu.so
Looking at the discussion of the julia folks about optimising this here should the arrays be transposed for better simd access? They discuss that they get significant benefit from laying the arrays out as Nx4, does that not mean the python code should be layed out as 4xN?
@paugier / @cycomanic can you do an extra run with an updated master?
Nop, it's not better with master.
This is my laptop, so slightly different numbers, but I still don't see a difference with and without -mfma
transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute : 1 * norm
norm = 0.0039 s
compute_opt : 1.02 * norm
compute_opt1 : 6.13 * norm
compute_opt2 : 1.04 * norm
transonic microbench_pythran.py -af "-march=native -Ofast -mfma"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute : 1 * norm
norm = 0.00402 s
compute_opt : 0.995 * norm
compute_opt1 : 6 * norm
compute_opt2 : 1 * norm
@cycomanic, it would be interesting to compare with what you get on your computer with Julia. What is the output for you of make julia_tuple
?
@paugier sorry yes forgot.
julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
2.862 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
2.644 ms (0 allocations: 0 bytes)
So I don't see the large difference between the 3 and 4D method with Julia on this computer (I recall it was larger on my desktop at home, I will update once I get home). Both methods are about 40% faster than pythran.
As a follow-up on my home desktop (Ryzen 3600) compared to my laptop (i7) pythran fares slightly better, but also no difference with and without -mfma
julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
2.669 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
2.125 ms (0 allocations: 0 bytes)
transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute : 1 * norm
norm = 0.00288 s
compute_opt : 1 * norm
compute_opt1 : 4.34 * norm
compute_opt2 : 1 * norm
transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute : 1 * norm
norm = 0.00304 s
compute_opt : 1.01 * norm
compute_opt1 : 3.61 * norm
compute_opt2 : 1.05 * norm
I get a new PR in the project https://github.com/paugier/nbabel for a more efficient Julia implementation. It's now 30% faster to what we get with Pythran. 30% is of course not that big and for real codes it would not matter. However, since I started to write a reply to the original paper that should be published somewhere, in term of communication, 30% is a bit too much. Moreover, it would be interesting to understand.
Of course, all the time is spent is one function (compute_acceleration). I wrote a small microbenchmark to show the performance issue (see https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_ju_tuple.jl and https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_pythran.py).
The particularity of the Julia implementation is to use a
Tuple{Float64, Float64, Float64, Float64}
to represent a point in 3D (a Julia Tuple is immutable). More SIMD instructions can be used which makes the function 20% faster than withTuple
of 3 floats.With Pythran, it gives with
dim = 3
(30% slower than Julia with 4 floats):(There are 2 slightly different implementations to ease experiments.)
And with
dim = 4
(this value has to be modified directly in the file), Pythran is slower (40% slower than Julia with 4 floats):@serge-sans-paille do you understand why Pythran is a bit slower? Do we face a case where Python-Numpy is really not enough expressive?
A remark that I guess is not the problem but still, I was surprised by the fact that
pythran -P
gives something like:I guess the C++ compiler is smart enough to simplify that, but it's a bit strange...