serge-sans-paille / pythran

Ahead of Time compiler for numeric kernels
https://pythran.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2k stars 193 forks source link

timings of 4 simple functions implementation #382

Open nbecker opened 9 years ago

nbecker commented 9 years ago

This is a benchmark of a simple function (clip) written in 4 ways:

  1. numpy / c++ (my limit function is written in c++)

  2. pythran without vector

  3. pythran with vector (but using slicing)

  4. c++

from limit import Limit
def clip (z, _max):
    mask = np.abs(z) > _max
    ## print ('#clipped:', np.sum(mask))
    z[mask] = (Limit (z) * _max)[mask]
    return z

2,

def limit1 (x, epsilon=1e-6):
    if abs(x) < epsilon:
        return 0
    else:
        return x / abs(x)

#pythran export clip(complex128[], float64)
#pythran export clip(complex128[::], float64)
def clip (z, _max):
    out = np.empty (z.shape, dtype=z.dtype)
    for i in range (len(z)):
        if abs(z[i]) > _max:
            out[i] = limit1 (z[i]) * _max
        else:
            out[i] = z[i]

    return out

3.

import numpy as np

#pythran export limit (complex128[])
def limit (x, epsilon=1e-6):
    out = np.empty (shape=x.shape, dtype=x.dtype)
    #out = np.empty_like (x, dtype=x.dtype)
    mask1 = np.abs(x) < epsilon
    out[mask1] = 0
    mask2 = np.logical_not(mask1)
    out[mask2] = x[mask2] / np.abs(x[mask2])
    return out

#pythran export clip(complex128[], float64)
def clip (z, _max):
    mask = np.abs(z) > _max
## print ('#clipped:', np.sum(mask))
    z[mask] = (limit (z[mask]) * _max)
    return z

4. c++ using ndarray library from ndarray.googlecode.com

timings are:

python test_clip.py numpy 2.94420599937 pythran non-vector 0.991260051727 pythran vector 1.71353602409 c++ 0.41252207756

pbrunet commented 9 years ago

Can you precise your inputs size?

nbecker commented 9 years ago

Here is test program:

import numpy as np

from limit import Limit
def clip (z, _max):
    mask = np.abs(z) > _max
    ## print ('#clipped:', np.sum(mask))
    z[mask] = (Limit (z) * _max)[mask]
    return z

from timeit import timeit

u = np.ones (1000000, dtype=complex)

print 'numpy'
print timeit ('clip(u, 10)', 'from __main__ import u, clip', number=100)

from clip import clip as clip2

print 'pythran non-vector'
print timeit ('clip2(u, 10)', 'from __main__ import u, clip2', number=100)

print 'pythran vector'
from clip2 import clip as clip3
print timeit ('clip3(u, 10)', 'from __main__ import u, clip3', number=100)

print 'c++'
from limit import clip as clip4
print timeit ('clip4(u, 10)', 'from __main__ import u, clip4', number=100)

## print timeit ('clip(u[::2], 10)', 'from __main__ import u, clip', number=100)
## print timeit ('clip2(u[::2], 10)', 'from __main__ import u, clip2', number=100)
pbrunet commented 9 years ago

Also, can you change out[mask2] = x[mask2] / np.abs(x[mask2]) by out[mask2] = np.sign(x[mask2]) ?

nbecker commented 9 years ago

That's not the same thing when x is complex, is it?

On Wed, Jan 7, 2015 at 10:30 AM, pbrunet notifications@github.com wrote:

Also, can you change out[mask2] = x[mask2] / np.abs(x[mask2]) by out[mask2] = np.sign(x[mask2]) ?

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69037584 .

Those who don't understand recursion are doomed to repeat it

pbrunet commented 9 years ago

You are right. Sorry I didn't look at the typing information. For now, I don't know why we don't perform as good as the C++ version but thanks you for the feedback.

Did you use special compilation flags for the C++ version? Optimisation? OpenMP? autovectorization?

nbecker commented 9 years ago

This is the compile command (the amcl part probably does nothing)

g++ -o limit.os -c -g -DBOOST_DISABLE_THREADS -O3 -march=native -ftree-vectorize -fstrict-aliasing -ffast-math -DNDEBUG -DBOOST_DISABLE_ASSERTS -std=c++1y -Wall -Wno-unused-local-typedefs -std=c++1y -fPIC -DHAVE_UNURAN=1 -DHAVE_CONSTRAINED=1 -DHAVE_TWISTER_SERIALIZATION=0 -I. -I/usr/local/src/ndarray/include -I/usr/include -I/usr/include/python2.7 -I/home/nbecker/.local/lib/python2.7/site-packages/numpy/core/include -I/usr/include/eigen3 -I/opt/intel/composerxe/ipp/include -I/opt/intel/composerxe/mkl/include -I/opt/acml5.3.0/gfortran64/include limit.cc

On Wed, Jan 7, 2015 at 10:47 AM, pbrunet notifications@github.com wrote:

You are right. Sorry I didn't look at the typing information. For now, I don't know why we don't perform as good as the C++ version but thanks you for the feedback.

Did you use special compilation flags for the C++ version? Optimisation? OpenMP? autovectorization?

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69040507 .

Those who don't understand recursion are doomed to repeat it

nbecker commented 9 years ago

g++ --version g++ (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1)

On Wed, Jan 7, 2015 at 10:47 AM, pbrunet notifications@github.com wrote:

You are right. Sorry I didn't look at the typing information. For now, I don't know why we don't perform as good as the C++ version but thanks you for the feedback.

Did you use special compilation flags for the C++ version? Optimisation? OpenMP? autovectorization?

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69040507 .

Those who don't understand recursion are doomed to repeat it

nbecker commented 9 years ago

results persist when I compile pythran code using same options: pythran -v -O3 -march=native -ftree-vectorize -fstrict-aliasing -ffast-math clip.py pythran -v -O3 -march=native -ftree-vectorize -fstrict-aliasing -ffast-math clip2.py

python test_clip.py numpy 2.94298911095 pythran non-vector 0.988628864288 pythran vector 1.74669694901 c++ 0.410514116287

On Wed, Jan 7, 2015 at 10:52 AM, Neal Becker ndbecker2@gmail.com wrote:

g++ --version g++ (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1)

On Wed, Jan 7, 2015 at 10:47 AM, pbrunet notifications@github.com wrote:

You are right. Sorry I didn't look at the typing information. For now, I don't know why we don't perform as good as the C++ version but thanks you for the feedback.

Did you use special compilation flags for the C++ version? Optimisation? OpenMP? autovectorization?

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69040507 .

Those who don't understand recursion are doomed to repeat it

Those who don't understand recursion are doomed to repeat it

joker-eph commented 9 years ago

Can you try removing fast-math and vectorize for GCC?

nbecker commented 9 years ago

python test_clip.py numpy 2.87930011749 pythran non-vector 0.987450122833 pythran vector 1.73527002335 c++ 0.831676959991

But why did pythran not improve when I added these flags to it?

On Wed, Jan 7, 2015 at 11:43 AM, Mehdi Amini notifications@github.com wrote:

Can you try removing fast-math and vectorize for GCC?

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69050165 .

Those who don't understand recursion are doomed to repeat it

joker-eph commented 9 years ago

Did you check which of fast-math or vectorize brought the boost? (can be the two together only)

Pythran does not improve because the generated C++ code can't be vectorized by gcc (I haven't looked into details why). But at least it gives some informations on where to look.

nbecker commented 9 years ago

seems -ffast-math is what makes it 2x faster (0.41s) -ftree-vectorize has little effect (0.83s)

On Wed, Jan 7, 2015 at 11:53 AM, Mehdi Amini notifications@github.com wrote:

Did you check which of fast-math or vectorize brought the boost? (can be the two together only)

Pythran does not improve because the generated C++ code can't be vectorized by gcc (I haven't looked into details why). But at least it gives some informations on where to look.

— Reply to this email directly or view it on GitHub https://github.com/serge-sans-paille/pythran/issues/382#issuecomment-69051791 .

Those who don't understand recursion are doomed to repeat it

joker-eph commented 9 years ago

Note that fast-math is wrong for any respectable numerical code anyway :)

serge-sans-paille commented 9 years ago

On Wed, Jan 07, 2015 at 07:02:37AM -0800, ndbecker wrote:

This is a benchmark of a simple function (clip) written in 4 ways:

  1. numpy / c++ (my limit function is written in c++)
  2. pythran without vector
  3. pythran with vector (but using slicing)
  4. c++ [...]

Thanks a lot neal for your input benchmark! I had a quick look, it appears the masked expressions are not very efficient in pythran (but still better than numpy, as shown by your bench).

I rewrote your numpy expression as the following:

def limit (x, epsilon=1e-6):

note the use of where here

return np.where( np.abs(x) < epsilon, 0, x/np.abs(x))

pythran export clip0(complex128[], float64)

def clip0 (z, _max): mask = np.abs(z) > _max

print ('#clipped:', np.sum(mask))

z[mask] = (limit (z[mask]) * _max)
return z

Under numpy, it's not as clever as it could, as x/np.abs(x) is computed for all x values. Under Pythran however, we have the opportunity to lazyly evaluate the expression so that only the relevant part are computed. this is not done because of a not so good implementation of numpy.where, but we have an opportunity here.

@pbrunet: do you see what i mean? As you wrote np.where, do you want to have a look, or I should add it to my TODO list?

pbrunet commented 9 years ago

I exactly see what you mean :-) I didn't do the modification because np.where is a trinary_expr so it could be really more generic than just improving np.where. For example numpy.clip could be see as a trinary_expr too with 2 of its arguments which will be scalar. We can add to this list at least : around (binary), angle(binary) and isclose (Quatrary_expr?)

I may try to have a look at this (it is already in my todo since 2014-10-05) but for now, I am working on fast subscript detection so I will not have time for it right now.

serge-sans-paille commented 7 years ago

For the record, once #686 is merged, numpy.where will run significantly faster, and the following implementation:

#pythran export clip0(complex128[], float64)
def clip0 (z, _max):
    return np.where( np.abs(z) > _max, np.where( np.abs(z) < 1e-6, 0, z/np.abs(z)) * _max, z)

runs as as fast as the loop version.