Performance with array-like classes defined with hpy?

paugier commented 2 years ago

This is a long term question/issue related to the piconumpy experiment and the performance results that I get with HPy and PyPy. These results are now so and so.

I thought piconumpy could be used to demonstrate the potential of HPy for numerical computing. With good results on small examples related to numerical computing with Python, I'm convinced that a lot of funding can easily be obtained for HPy. But we would need impressive results.

For numerical computing with Python, we lack in pure Python homogeneous ndarrays of native types and of user defined classes. Numpy provides homogeneous ndarrays of native types and inhomogeneous ndarrays of Python objects. Great! But Numpy is written in C and thus low level code using Numpy can't be accelerated with a JIT accelerating pure Python code. As soon as the C/Python border is crossed too much, it becomes super slow for any Python implementations, even with a good JIT.

The available solutions are based on moving the C/Python border outside of the hot code by reimplementing Numpy/Python (as in Pythran, Numba or NumPyPy).

A more general solution without reimplementation would be a game changer. I'd like to understand if HPy could go in this direction.

This very simple microbenchmark on just summing floats is sufficient to explain what I mean here. With Piconumpy now ported to HPy, we can start to see what it could give. The function benchmarked is just:

def sum_loop(arr):
    result = 0.0
    for value in arr:
        result += value
    return result

The results can be summarized as follow. I normalized by the time obtained with Julia.

With PyPy3.7

{'cache_tag': 'pypy37',
 'version': sys.pypy_version_info(major=7, minor=3, micro=6, releaselevel='final', serial=0)}
list                          : 1.34e-05 s (  1.3 * Julia)
piconumpy.purepy              : 1.33e-05 s (  1.3 * Julia)
numpy                         : 4.00e-03 s (376.6 * Julia)
_piconumpy_hpy                : 1.99e-04 s ( 18.8 * Julia)
_piconumpy_cpython_capi       : 1.27e-03 s (119.5 * Julia)

With CPython

{'cache_tag': 'cpython-39',
 'version': sys.version_info(major=3, minor=9, micro=6, releaselevel='final', serial=0)}
list                          : 2.62e-04 s ( 24.6 * Julia)
piconumpy.purepy              : 1.25e-03 s (117.5 * Julia)
numpy                         : 7.35e-04 s ( 69.2 * Julia)
_piconumpy_hpy                : 4.26e-04 s ( 40.2 * Julia)
_piconumpy_cpython_capi       : 3.52e-04 s ( 33.1 * Julia)

Comments on the results

PyPy is fast with lists (1.3 * Julia!) and ~ as fast for a purepython piconumpy array based on a list ("piconumpy.purepy", zero cost abstraction!)
Numpy and _piconumpy_cpython_capi are both much slower with PyPy than with Cpython. We can guess that the Numpy port to HPy would fix that.
piconumpy_hpy is a bit faster with PyPy (19 Julia) than with CPython (40 Julia), however, we see that PyPy does not strongly accelerate piconumpy_hpy (19 Julia = 10 piconumpy_list).

I'm not familiar enough with PyPy traces to understand how the JIT handles the hot loop in the case of piconumpy_hpy (the traces are here), but the performance is consistent with a very small acceleration by the JIT. It seems that PyPy JIT does not "understand" the "nature of piconumpy_hpy array" and is not able to use its good characteristics for performance (fixed size, homogeneous in type, aligned native array, ...).

Few questions:

Are there potentials for improvement of acceleration in the case of the piconumpy_hpy array ? How much could we win (keeping the the same HPy code in piconumpy)?
Would there be some needs of an extended API to be able to describe in C array-like types leading in PyPy to results similar to homogeneous lists.

steve-s commented 2 years ago

Fundamentally no JIT can "see though" a native code of an extension, so it cannot optimize it and it cannot inline such code into the surrounding Python code to optimized the whole thing as one compilation unit.

GraalPython has a mode where it does not execute the actual native code, but LLVM bitcode inside the same JIT as the Python code. So there we can theoretically get on par or close. I changed your benchmark a bit, because GraalPython needs bit longer to warm up, so I excluded first few iterations. Here are the results of GraalPython:

list: 0.00012 s
numpy: 0.34357 s
piconumpy HPy (universal mode): 0.00215 s

We do not get on par, because there is also one more issue of the API itself. Consider the underlying C function that gets called in the loop:

HPyDef_SLOT(Array_item, Array_item_impl, HPy_sq_item)
HPy Array_item_impl(HPyContext *ctx, HPy h_arr, HPy_ssize_t index) {  
  ArrayObject *arr = ArrayObject_AsStruct(ctx, h_arr);
  if (index < 0 || index >= arr->size) {
    HPyErr_SetString(ctx, ctx->h_IndexError, "index out of range");
    return HPy_NULL;
  }
  HPy item = HPyFloat_FromDouble(ctx, arr->data[index]);
  return item;
};

The issue is that:

HPyErr_SetString(ctx, ctx->h_IndexError, "index out of range");

really needs to create the exception unlike if it is pure Python code with a Python list, where we can optimize the exception away. In theory we should be able to optimize it away across the language boundary if both languages are compiled with the same JIT, but in practice it turns out to be much more difficult. Among other things we'd need to analyze the C code between HPyErr_SetString and return HPy_NULL to see that it's actually not doing anything with the exception.

When I hacked GraalPython to not create the full Python level exception in this case (which is not correct in general), the performance gets closer:

piconumpy HPy (with IndexError hack): 0.00013 s

I think that even for Pypy it would be useful if it knew that it actually does not have to create the exception. If we agree this is really important use case, we could, for example, provide some token that, if returned, would indicate that the HPy function wishes to exit with an exception of type IndexError, so the code would be:

HPyDef_SLOT(Array_item, Array_item_impl, HPy_sq_item)
HPy Array_item_impl(HPyContext *ctx, HPy h_arr, HPy_ssize_t index) {  
  ArrayObject *arr = ArrayObject_AsStruct(ctx, h_arr);
  if (index < 0 || index >= arr->size) {
    return HPy_ErrorResult_IndexError;
  }
  HPy item = HPyFloat_FromDouble(ctx, arr->data[index]);
  return item;
};

hodgestar commented 2 years ago

@steve-s Thank you for the write-up and for looking into this.

I don't understand how HPyErr_SetString affects this benchmark. Presumably the index is always in range in the benchmark and so HPyErr_String is never executed? So it's mere presence affects something? There is something I am not understanding yet.

steve-s commented 2 years ago

The iteration goes all the way until it is outside of the range and then it stops. (maybe there is already some way to implement iterations without exceptions on the C level that piconumpy does not support? If it did not execute at all, the profiles would remove that branch from the JIT compilation in GraalPython's case).

hodgestar commented 2 years ago

Ah, this is because tp_iter is not defined so it falls back to PySeqIter_Type which just starts at zero and continues until IndexError or StopIteration is raised. This is not the path taken for lists.

Perhaps as step one we can implement a tp_iter that does better?

paugier commented 2 years ago

list: 0.00012 s

piconumpy HPy (with IndexError hack): 0.00013 s

This is a great and very promising result! I'm very impressed. It seems that GraalPython manages in this very simple case to correctly optimized Python + C together. If I understand correctly, one could get the same kind of result without the IndexError hack by implementing tp_iter in piconumpy, isn't it?

GraalPython has a mode where it does not execute the actual native code, but LLVM bitcode inside the same JIT as the Python code.

Is it a global mode or can it be activated only for some functions? Or it is used automatically when needed? Is this mode documented somewhere? Would GraalPython need indications that this mode should be used for some particular functions? Typically, this could be useful for functions used for indexing.

Could the same kind of method be used for PyPy? I guess PyPy would need to produce something like Python code from HPy C code for small functions like those used for indexing. Could something like this be possible?

steve-s commented 2 years ago

If I understand correctly, one could get the same kind of result without the IndexError hack by implementing tp_iter in piconumpy, isn't it?

Seems so. I need to take a look at that.

Is it a global mode or can it be activated only for some functions? Or it is used automatically when needed? Is this mode documented somewhere? Would GraalPython need indications that this mode should be used for some particular functions? Typically, this could be useful for functions used for indexing.

This is the default mode and for now it's all or nothing. GraalPython can also run CPython API in this mode (and only in this mode), but HPy performs better. HPy extensions can be run also natively, but this is work in progress. The drawbacks of the LLVM bitcode mode is that you need to install the extensions from source as we need to compile it to LLVM bitcode, and that it needs more time to JIT compile the LLVM bitcode at runtime before it gets to good performance.

paugier commented 2 years ago

Just a crazy and naive idea / question: would it be possible / useful to be able to declare in HPy that some functions could potentially be "compiled and inlined by Python JITs" so that for these functions, their C code would be stored in the universal extension?

Such mechanism could even be extended to also provide a Python version of the code for these functions which could be used by Python JIT that works better with Python (like PyPy's JIT if I understand well). Such Python implementation would need to use native C objects (like arr->data in the example) but I remember that PyPy can be very good to accelerate such code when written with cffi.

steve-s commented 2 years ago

If I understand correctly, one could get the same kind of result without the IndexError hack by implementing tp_iter in piconumpy, isn't it?

Seems so. I need to take a look at that.

Actually no, because the contract of tp_iter is the same: it'll end up throwing an exception once the iteration is done. This is a bit unfortunate (from the JIT POV) contract in Python that end of an iteration is indicated with regular exception that needs to have all the semantics like any other exception.

Just a crazy and naive idea / question: would it be possible / useful to be able to declare in HPy that some functions could potentially be "compiled and inlined by Python JITs" so that for these functions, their C code would be stored in the universal extension?

I think that is/could be the aim with Cython, but others who understand Cython better can comment on this.

paugier commented 2 years ago

I updated the benchmarks to include different use cases of low level Python code (https://github.com/hpyproject/piconumpy/tree/microbench_loop_sum/bench/microbench_low_level) that should be IMO important for scientific / data applications. If such code could be accelerated by some Python implementations, it would be a real game changer.

Here are some interesting results:

sum_loop
sum_loop_index
init_zeros
cort (normalized cosine similarity measure between derivatives)

I also adapted the benchmark code to be fully compatible with PyPy and GraalPython. The results obtained for GraalPython are obtained with piconumpy installed from source.

Indeed, we see very promising accelerations with GraalPython.

So there are 3 different problems:

The extension needs to be installed from source. This won't be the case in most cases for for example Numpy written in HPy. It would be nice to find a general mechanism to avoid this need.
HPyErr_SetString(ctx, ctx->h_IndexError, "index out of range"); seems to be a real issue to get very good acceleration. It would be nice to have a mechanism to avoid it.
PyPy is unable to accelerate low level code written in HPy. It would be great to find a mechanism to help PyPy's JIT.

If GraalPython/PyPy are able get good results on these types of benchmarks, I think the scientific / data Python community could really love HPy!

mattip commented 2 years ago

PyPy is unable to accelerate low level code written in HPy. It would be great to find a mechanism to help PyPy's JIT.

Isn't that a bit pessimistic? If I am reading correctly, the benchmark for _piconumpy_hpy has PyPy at 18x which is faster than CPython's _piconumpy_cpython_capi (at 33x). It is also 10x PyPy's _piconumpy_cpython_capi, so quite a win.

With that, PyPy could probably do better with _piconumpy_hpy.

paugier commented 2 years ago

Isn't that a bit pessimistic?

Sorry, I don't understand where your numbers come from :slightly_smiling_face:

For sum_loop:

GOOD: PyPy piconumpy.hpy is 26% faster than CPython piconumpy.cpython_capi (3.73e-04 s versus 5.07e-04 s) and 80% faster than PyPy piconumpy.cpython_capi (3.73e-04 s versus 1.75e-03 s).
BAD: PyPy piconumpy.hpy is ~ 14x slower than PyPy piconumpy.purepy (a Python object containing a list) (3.73e-04 s versus 2.60e-05 s).

I think that for real applications, the last ratio (14 x slower than PyPy piconumpy.purepy) is more meaningful. The performance of PyPy piconumpy.purepy is I think comparable (same order of magnitude) to what one can get with other standard solutions (Numba, Pythran, Cython or even Julia).

Targeting for HPy extensions (piconumpy, then numpy) the performance reached by PyPy piconumpy.purepy would be very useful to convince the scientific Python community that HPy has to be largely founded.

The results mentioned by @steve-s seem to indicate that it's doable but that some functions (get_item, set_item, iter) have to be inlined to "optimized the whole thing as one compilation unit". This is why I was thinking about providing in the HPy extension for few functions, Python equivalent code available for JITs that work at the Python level and cannot work with the C code or with the binary.

mattip commented 2 years ago

Here is my spin on the numbers, in terms of julia = 1.0 interpreter	benchmark	"julia factor"	Notes
CPython	_piconumpy_cpython_capi	33.1x
PyPy	_piconumpy_cpython_capi	119.5x	This is what happens today with common c-extensions
PyPy	_piconumpy_hpy	18.8x	7x faster than using the PyPy C-API, close to 2x faster than using CPython
PyPy	piconumpy.purepy	1.3x	Great, but how do we convince people to rewrite their code to get this?

So by moving to HPy, which has buy-in from the CPython community PyPy gets a 7x speedup. Could we do better? Sure. But more radical solutions (moving to Julia, rewriting code in pure python) are much more "expensive" in terms of developer buy-in.

Personally I don't think HPy is the last word in this journey. It is a convenient tool to target the massive amount of existing code, a step away from the C-API, and a way to convince people to start using PyPy/GraalPython. Once we get the ball rolling, we will have more opportunities to take the next steps, whatever they will be.

paugier commented 2 years ago

Ah sorry, now I understand where "your" numbers come from (from my first message :slightly_smiling_face:). It does not change the reasoning but these numbers should be a bit more accurate (more stable benchmarks):

interpreter	benchmark	"julia factor"	Notes
CPython	_piconumpy_cpython_capi	38x
PyPy	_piconumpy_cpython_capi	132x	This is what happens today with common c-extensions
PyPy	_piconumpy_hpy	28x	5x faster than using the PyPy C-API, 1.3x faster than using CPython
PyPy	piconumpy.purepy	2x	Great, but how do we convince people to rewrite their code to get this?

paugier commented 2 years ago

I added another micro benchmark: board is a function with few indexing, simple float computations with sin/cos and instantiation of a small array. It seems to be important for the global performance of the main benchmark used for piconumpy.

The results are in this file result_board.md.

For PyPy, it gives:

bench board
hostname: voyage
{'cache_tag': 'pypy37',
 'version': sys.pypy_version_info(major=7, minor=3, micro=7, releaselevel='final', serial=0)}
list                          : 3.21e-07 s (  0.9 * Julia)
piconumpy.purepy              : 1.37e-05 s ( 36.9 * Julia)
numpy                         : 1.18e-04 s (316.6 * Julia)
piconumpy.hpy                 : 1.26e-05 s ( 33.8 * Julia)
piconumpy.cpython_capi        : 5.52e-05 s (148.6 * Julia)

It's interesting to see that except for PyPy with the method with list (which is super efficient), the implementations with a JIT aren't efficient (in particular PyPy piconumpy.purepy is 41x slower than the solution with list).

hpyproject / hpy