serge-sans-paille / pythran

Ahead of Time compiler for numeric kernels
https://pythran.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.99k stars 193 forks source link

Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

Open paugier opened 3 years ago

paugier commented 3 years ago

I get a new PR in the project https://github.com/paugier/nbabel for a more efficient Julia implementation. It's now 30% faster to what we get with Pythran. 30% is of course not that big and for real codes it would not matter. However, since I started to write a reply to the original paper that should be published somewhere, in term of communication, 30% is a bit too much. Moreover, it would be interesting to understand.

Of course, all the time is spent is one function (compute_acceleration). I wrote a small microbenchmark to show the performance issue (see https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_ju_tuple.jl and https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_pythran.py).

The particularity of the Julia implementation is to use a Tuple{Float64, Float64, Float64, Float64} to represent a point in 3D (a Julia Tuple is immutable). More SIMD instructions can be used which makes the function 20% faster than with Tuple of 3 floats.

$ make julia_tuple
julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.344 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  1.855 ms (0 allocations: 0 bytes)

With Pythran, it gives with dim = 3 (30% slower than Julia with 4 floats):

make pythran
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00277 s
compute_opt                      :     1 * norm

(There are 2 slightly different implementations to ease experiments.)

And with dim = 4 (this value has to be modified directly in the file), Pythran is slower (40% slower than Julia with 4 floats):

make pythran
python microbench_pythran.py
shape= (1024, 4)
compute                          :     1 * norm
norm = 0.00308 s
compute_opt                      :  1.18 * norm

@serge-sans-paille do you understand why Pythran is a bit slower? Do we face a case where Python-Numpy is really not enough expressive?

A remark that I guess is not the problem but still, I was surprised by the fact that pythran -P gives something like:

coef = (1.0 / (np.square(math.sqrt(builtins.sum(np.square(vector)))) * math.sqrt(builtins.sum(np.square(vector)))))

I guess the C++ compiler is smart enough to simplify that, but it's a bit strange...

cycomanic commented 3 years ago

Are you sure it is simd instructions? At least removing the @simd from the julia loop does not make a difference (does julia use simd instructions on broadcasting?). Similarly compiling with or without -DUSE_XSIMD does not seem to make a difference for the pythran code.

One thing I notice is when I change:

for i in range(dim):
    vector[i] = position0[i] - positions[index_p1, i]
...
for i in range(dim):
    accelerations[index_p0, i] -= coef * mass1 *vector[i]
    accelerations[index_p1, i] += coef *mass0*vector[i]

to

vector = position0 - positions[index_p1]
...
accelerations[index_p0] -= coef * mass1 * vector
accelerations[index_p1] += coef*mass0*vector

I see a 10x slowdown.

serge-sans-paille commented 3 years ago

Can you give a try to the following code (both _opt and the base one) compiled with clang, using that branch https://github.com/serge-sans-paille/pythran/pull/1712 and the -mfma flag?

from math import sqrt
import numpy as np

dim = 3

#pythran export compute(float[:,3], float[:], float[:,3])
def compute(accelerations, masses, positions):
    nb_particules = masses.size
    vector = np.empty_like(positions[0])
    acc0 = np.empty_like(accelerations[0])
    for index_p0 in range(nb_particules - 1):
        np.copyto(acc0, accelerations[index_p0])
        for index_p1 in range(index_p0 + 1, nb_particules):
            np.copyto(vector, positions[index_p0] - positions[index_p1])
            distance = sqrt(np.sum(vector ** 2))
            coef = np.power(distance, -3)
            np.copyto(acc0, acc0 + coef * masses[index_p1] * vector)
            accelerations[index_p1] += coef * masses[index_p0] * vector
        accelerations[index_p0] = -acc0

#pythran export compute_opt(float[:,:], float[:], float[:,:])
def compute_opt(accelerations, masses, positions):
    nb_particules = masses.size
    vector = np.empty(dim)
    accs = np.empty(dim)
    for index_p0 in range(nb_particules - 1):
        position0 = positions[index_p0]
        mass0 = masses[index_p0]
        accs.fill(0)
        for index_p1 in range(index_p0 + 1, nb_particules):
            for i in range(dim):
                vector[i] = position0[i] - positions[index_p1, i]
            d2 = 0.
            for i in range(dim):
                d2 = d2 + vector[i] * vector[i]
            distance3 = d2 * sqrt(d2)
            coef1 = masses[index_p1] / distance3
            coef0 = mass0 / distance3
            for i in range(dim):
                accs[i] = accs[i] + coef1 * vector[i]
                accelerations[index_p1, i] = accelerations[index_p1, i] + coef0 * vector[i]
        accelerations[index_p0] -= accs
paugier commented 3 years ago

Not faster at home :-( (opt1 and opt2 are the functions in your previous message)

shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00301 s
compute_opt                      : 0.998 * norm
compute_opt1                     :  2.63 * norm
compute_opt2                     :  1.02 * norm
cycomanic commented 3 years ago

Same here

shape= (1024, 3)
compute                          :     1 * norm
norm = 0.0029 s
compute_opt                      :     1 * norm
compute2                         :  6.85 * norm
compute3                         :  2.41 * norm
compute_opt3                     :     1 * norm

(2 is the compute where I changed the loop over vector to vector operations, 3 are the functions from your message).

serge-sans-paille commented 3 years ago

@cycomanic / @paugier even with -mfma ?

cycomanic commented 3 years ago

@serge-sans-paille yes, -mfma did not make a difference (neither with gcc nor clang). Could it be an alignment issue, so that simd instructions are not used. Is there a way to check?

paugier commented 3 years ago

I confirm again. -mfma has no effect.

$ pythran -Ofast -march=native -mfma -vv microbench_pythran.py                                
INFO:      sys file exists: /home/pierre/Dev/pythran/pythran/pythran.cfg
INFO: platform file exists: /home/pierre/Dev/pythran/pythran/pythran-linux.cfg
INFO:     user file exists: /home/pierre/.pythranrc
INFO: pythranrc section [pythran] is valid and options are correct
INFO: pythranrc section [typing] is valid and options are correct
INFO: pythranrc section [compiler] is valid and options are correct
running build_ext
new_compiler returns <class 'distutils.unixccompiler.UnixCCompiler'>
building 'microbench_pythran' extension
C compiler: clang++-10 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC

creating /tmp/tmp6jf_ia_m/tmp
compile options: '-DENABLE_PYTHON_MODULE -D__PYTHRAN__=3 -DPYTHRAN_BLAS_OPENBLAS -I/home/pierre/Dev/pythran/pythran -I/home/pierre/Dev/pythran/pythran/pythonic/patch -I/home/pierre/.pyenv/versions/3.8.2/lib/python3.8/site-packages/numpy/core/include -I/home/pierre/.pyenv/versions/3.8.2/include/python3.8 -c'
extra options: '-std=c++11 -fno-math-errno -w -fvisibility=hidden -fno-wrapv -Ofast -march=native -mfma'
clang++-10: /tmp/tmpu_itryoy.cpp
clang++-10 -pthread -shared -L/home/pierre/.pyenv/versions/3.8.2/lib -L/home/pierre/.pyenv/versions/3.8.2/lib /tmp/tmp6jf_ia_m/tmp/tmpu_itryoy.o -L/usr/lib/x86_64-linux-gnu -lopenblas -lopenblas -o /tmp/tmpuddb59eb/microbench_pythran.cpython-38-x86_64-linux-gnu.so -fvisibility=hidden -Wl,-strip-all -Ofast -march=native -mfma
INFO: Generated module: microbench_pythran
INFO: Output: /home/pierre/Dev/nbabel/py/microbench/__pythran__/microbench_pythran.cpython-38-x86_64-linux-gnu.so
cycomanic commented 3 years ago

Looking at the discussion of the julia folks about optimising this here should the arrays be transposed for better simd access? They discuss that they get significant benefit from laying the arrays out as Nx4, does that not mean the python code should be layed out as 4xN?

serge-sans-paille commented 3 years ago

@paugier / @cycomanic can you do an extra run with an updated master?

paugier commented 3 years ago

Nop, it's not better with master.

cycomanic commented 3 years ago

This is my laptop, so slightly different numbers, but I still don't see a difference with and without -mfma

transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.0039 s
compute_opt                      :  1.02 * norm
compute_opt1                     :  6.13 * norm
compute_opt2                     :  1.04 * norm
transonic microbench_pythran.py -af "-march=native -Ofast -mfma"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00402 s
compute_opt                      : 0.995 * norm
compute_opt1                     :     6 * norm
compute_opt2                     :     1 * norm
paugier commented 3 years ago

@cycomanic, it would be interesting to compare with what you get on your computer with Julia. What is the output for you of make julia_tuple ?

cycomanic commented 3 years ago

@paugier sorry yes forgot.

julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.862 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  2.644 ms (0 allocations: 0 bytes)

So I don't see the large difference between the 3 and 4D method with Julia on this computer (I recall it was larger on my desktop at home, I will update once I get home). Both methods are about 40% faster than pythran.

cycomanic commented 3 years ago

As a follow-up on my home desktop (Ryzen 3600) compared to my laptop (i7) pythran fares slightly better, but also no difference with and without -mfma

julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.669 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  2.125 ms (0 allocations: 0 bytes)
transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00288 s
compute_opt                      :     1 * norm
compute_opt1                     :  4.34 * norm
compute_opt2                     :     1 * norm
transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00304 s
compute_opt                      :  1.01 * norm
compute_opt1                     :  3.61 * norm
compute_opt2                     :  1.05 * norm