jakeret / hope

HOPE: A Python Just-In-Time compiler for astrophysical computations
GNU General Public License v3.0
382 stars 27 forks source link

ln_python 10th polynomial benchmark variant #52

Open pigay opened 8 years ago

pigay commented 8 years ago

Hi,

I would like to contribute a variant on ln_python benchmark.

Instead of using numpy slice, we directly access to array elements inside a loop. This is of course terribly slow for interpreted Python but seems to be faster with compiled code like generated with HOPE and numba. In my tests, it happens to be fairly faster than the numpy slice version.

In the following code, I had to add a N parameter beacuse HOPE doesn't support np.ndarray.size attribute. I also use a Horner schema for the polynomial evaluation.

def loop_ln_python(X,Y, N):
    for i in xrange(N):
        xm1 = X[i] - 1.
        tmp = xm1 * (1./9.) - 1./8.
        tmp = tmp * xm1 + 1./7.
        tmp = tmp * xm1 - 1./6.
        tmp = tmp * xm1 + 1./5.
        tmp = tmp * xm1 - 1./4.
        tmp = tmp * xm1 + 1./3.
        tmp = tmp * xm1 - 1./2.
        tmp = tmp * xm1 + 1.
        Y[i] = tmp * xm1

loop_ln_hope = hope.jit(loop_ln_python)
loop_ln_numba = numba.autojit(loop_ln_python)

I test with the following:

X = np.random.random(10000).astype(np.float64)
Y = np.ones_like(X)

loop_ln_hope(X, Y, len(X))
loop_ln_numba(X, Y, len(X))

%timeit loop_ln_python(X, Y, len(X))
%timeit loop_ln_hope(X, Y, len(X))
%timeit loop_ln_numba(X, Y, len(X))

On my laptop, I get the following timings:

100 loops, best of 3: 15.1 ms per loop
10000 loops, best of 3: 46.1 µs per loop
100000 loops, best of 3: 12.7 µs per loop

To be compared to the ln_python_exp() timings:

1000 loops, best of 3: 257 µs per loop
10000 loops, best of 3: 76.8 µs per loop
10000 loops, best of 3: 110 µs per loop

I hope it will be of some interest for you.

For the sake of completeness, we can write a Horner schema version that doesn't add much to the compiled performances but which is more fair for numpy interpreted code:

def ln_python_horner(X, Y):
    Xm1 = X - 1
    Y[:] = Xm1 / 9. - 1./8.
    Y[:] = Y * Xm1 + 1./7.
    Y[:] = Y * Xm1 - 1./6.
    Y[:] = Y * Xm1 + 1./5.
    Y[:] = Y * Xm1 - 1./4.
    Y[:] = Y * Xm1 + 1./3.
    Y[:] = Y * Xm1 - 1./2.
    Y[:] = Y * Xm1 + 1.
    Y[:] = Y * Xm1

And with the timings:

10000 loops, best of 3: 169 µs per loop
10000 loops, best of 3: 104 µs per loop
10000 loops, best of 3: 119 µs per loop

Anyway, thanks for this nice package.

Cheers,

Pierre

cosmo-ethz commented 8 years ago

Hi Pierre Thanks for contributing this. Nice idea with the Horner schema!

I'm wondering how numba manages to squeeze out this additional factor of 4. As soon as I find some time I will investigate a bit more.

Best Joel

pigay commented 8 years ago

Hi Joel,

Here is a native version of the Horner schema:

#define PY_ARRAY_UNIQUE_SYMBOL loop_ln_ARRAY_API
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#include <Python.h>
#include <numpy/arrayobject.h>
#include <numpy/arrayscalars.h>
#include <cmath>
#include <tuple>
#include <numeric>
#include <cstdint>
#include <cstdlib>
#include <exception>
#include <functional>
#include <type_traits>
struct PyObj {
    typedef PyObject * ptr_t;
    typedef PyArrayObject * arrptr_t;
    PyObj(): dec(false), ptr(NULL) {}
    PyObj(ptr_t p): dec(false), ptr(p) {}
    ~PyObj() { if(dec) Py_DECREF(ptr); }
    PyObj & operator=(ptr_t p) { if(dec) Py_DECREF(ptr); ptr = p; dec = false; return *this; }
    PyObj & incref(ptr_t p) { if(dec) Py_DECREF(ptr); ptr = p; dec = (p != NULL); return *this; }
    operator bool() const { return ptr; }
    operator ptr_t() const { return ptr; }
    operator arrptr_t() const { return (arrptr_t)ptr; }
    bool dec;
    ptr_t ptr;
};

inline void loop_ln_d1d1(PyObject * pX, npy_intp const * __restrict__ sX, npy_double * __restrict__ cX, 
                                                    PyObject * pY, npy_intp const * __restrict__ sY, npy_double * __restrict__ cY);

inline void loop_ln_d1d1(PyObject * pX, npy_intp const * __restrict__ sX, npy_double * __restrict__ cX, 
                                                    PyObject * pY, npy_intp const * __restrict__ sY, npy_double * __restrict__ cY){

    for (npy_intp i0 = 0; i0 < sY[0] - 0; ++i0) {
        npy_double xm1, tmp;
        xm1 = cX[i0] - 1.;
        tmp = xm1 * (1./9.) - 1./8.;
        tmp = tmp * xm1 + 1./7.;
        tmp = tmp * xm1 - 1./6.;
        tmp = tmp * xm1 + 1./5.;
        tmp = tmp * xm1 - 1./4.;
        tmp = tmp * xm1 + 1./3.;
        tmp = tmp * xm1 - 1./2.;
        tmp = tmp * xm1 + 1.;
        cY[i0] = tmp * xm1;
    }
}

void sighandler(int sig);
#include <signal.h>
extern "C" {
    PyObject * create_signature;
    struct sigaction slot;
    PyObject * set_create_signature(PyObject * self, PyObject * args) {
        if (!PyArg_ParseTuple(args, "O", &create_signature)) {
            PyErr_SetString(PyExc_ValueError, "Invalid Argument to set_create_signature!");
            return NULL;
        }
        Py_INCREF(create_signature);
        memset(&slot, 0, sizeof(slot));
        slot.sa_handler = &sighandler;
        sigaction(SIGSEGV, &slot, NULL);
        sigaction(SIGBUS, &slot, NULL);
        Py_INCREF(Py_None);
        return Py_None;
    }
    PyObject * run(PyObject * self, PyObject * args) {
        {
            PyObj pX;
            PyObj pY;
            if (
                PyTuple_CheckExact(args) and PyTuple_GET_SIZE(args) == 2
                and (pX = PyTuple_GET_ITEM(args, 0)) and PyArray_CheckExact(pX)
                and PyArray_TYPE((PyArrayObject *)pX) == NPY_FLOAT64 and PyArray_NDIM((PyArrayObject *)pX) == 1
                and (pY = PyTuple_GET_ITEM(args, 1)) and PyArray_CheckExact(pY)
                and PyArray_TYPE((PyArrayObject *)pY) == NPY_FLOAT64 and PyArray_NDIM((PyArrayObject *)pY) == 1
            ) {
                if (!(pX.incref((PyObject *)PyArray_GETCONTIGUOUS((PyArrayObject *)pX)))) {
                    PyErr_SetString(PyExc_ValueError, "Invalid Argument type on X!");
                    return NULL;
                }
                if (!(pY.incref((PyObject *)PyArray_GETCONTIGUOUS((PyArrayObject *)pY)))) {
                    PyErr_SetString(PyExc_ValueError, "Invalid Argument type on Y!");
                    return NULL;
                }
                try {
                    loop_ln_d1d1(
                          pX, PyArray_SHAPE((PyArrayObject *)pX), (npy_double *)PyArray_DATA((PyArrayObject *)pX)
                        , pY, PyArray_SHAPE((PyArrayObject *)pY), (npy_double *)PyArray_DATA((PyArrayObject *)pY)
                    );
                    Py_INCREF(Py_None);
                    return Py_None;
                } catch (...) {
                    return NULL;
                }
            } else
                PyErr_Clear();
        }
        PyObject * signatures = Py_BuildValue("(sO)", "(lp0\n(lp1\nccopy_reg\n_reconstructor\np2\n(chope._ast\nVariable\np3\nc__builtin__\nobject\np4\nNtp5\nRp6\n(dp7\nS'name'\np8\nS'X'\np9\nsS'dtype'\np10\nS'float64'\np11\nsS'dims'\np12\nI1\nsbag2\n(g3\ng4\nNtp13\nRp14\n(dp15\ng8\nS'Y'\np16\nsg10\ng11\nsg12\nI1\nsbaa.", args);
        if (!signatures) {
            PyErr_SetString(PyExc_ValueError, "Error building signature string for loop_ln");
            return NULL;
        }
        return PyObject_Call(create_signature, signatures, NULL);
    }
    PyMethodDef loop_lnMethods[] = {
        { "set_create_signature", (PyCFunction)set_create_signature, METH_VARARGS },
        { "run", (PyCFunction)run, METH_VARARGS },
        { NULL, NULL }
    };
    PyMODINIT_FUNC initloop_ln(void) {
        import_array();
        PyImport_ImportModule("numpy");
        (void)Py_InitModule("loop_ln", loop_lnMethods);
    }
}
#include <string>
#include <sstream>
#include <iostream>
#include <cxxabi.h>
#include <execinfo.h>
void sighandler(int sig) {
    std::ostringstream buffer;
    buffer << "Abort by " << (sig == SIGSEGV ? "segfault" : "bus error") << std::endl;
    void * stack[64];
    std::size_t depth = backtrace(stack, 64);
    if (!depth)
        buffer << "  <empty stacktrace, possibly corrupt>" << std::endl;
    else {
        char ** symbols = backtrace_symbols(stack, depth);
        for (std::size_t i = 1; i < depth; ++i) {
            std::string symbol = symbols[i];
                if (symbol.find_first_of(' ', 59) != std::string::npos) {
                    std::string name = symbol.substr(59, symbol.find_first_of(' ', 59) - 59);
                    int status;
                    char * demangled = abi::__cxa_demangle(name.c_str(), NULL, NULL, &status);
                    if (!status) {
                        buffer << "    " 
                            << symbol.substr(0, 59) 
                            << demangled
                            << symbol.substr(59 + name.size())
                            << std::endl;
                        free(demangled);
                    } else
                        buffer << "    " << symbol << std::endl;
                } else
                    buffer << "    " << symbol << std::endl;
            }
            free(symbols);
        }
        std::cerr << buffer.str();
        std::exit(EXIT_FAILURE);
    }

Using llvm clang to compile it on my laptop, I runs:

100000 loops, best of 3: 11.3 µs per loop

Pretty interesting how numba gets very close to native performance. Unfortunately, I can't read assembly code to understand that.

Regards, Pierre