numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
28.21k stars 10.16k forks source link

ENH: Add support SLEEF for transcendental functions #23068

Open yamadafuyuka opened 1 year ago

yamadafuyuka commented 1 year ago

Functions such as sin and log use libm except for AVX512_SKX, and at least in my environment SIMD instruction were not used. Therefore, I added implementation to use SIMD library SLEEF ( https://sleef.org/ ) and measured the calculation time of some functions. My branch: ( https://github.com/yamadafuyuka/numpy/tree/add_SLEEF )

I graphed the results. We also confirmed that using SVE intrinsics as in ( PR-22265 ) further speeds up (the log10 function is about 4 times faster). I would like to add SLEEF support, but I am not sure which part of NumPy is the best place to implement it. Could you please advise?

mattip commented 1 year ago

We are trying to move towards using universal intrinsics inside NumPy, so I am not sure we would want to mix in a whole new library with a different paradigm.

What version of NumPy did you test? On what platform?

kawakami-k commented 1 year ago

Hi, @mattip I'm implimenting SVE support for Numpy with @yamadafuyuka .

The motivation of this issue is to improve the calculation speed of transcendental functions such as sin/cos/tan/log2/log10/exp and etc. For x64 on Linux, NumPy can be built with SVML and calculation is vectorized. In my understanding, for non-x64 CPUs, compilers links NumPy with libmath.so for the transcendental functions, that provides not-vectorized transcendental functions. Is this right?

SLEEF is one of the vectorized mathematical library. It supports multiple architecture as show in Table 1.1. Becuase the function names of SLEEF follow its naming convention, it is easy to abstract function name and write source code for multiple architectures/multiple instruction sets. The below is a function name example. u10 means that the function achieves 1.0-ULP calculation accuracy.

Transcendental function Data type ISA SLEEF function name
sine float Arm NEON Sleef_sinf4_u10
sine float Arm SVE Sleef_sinfx_u10sve
sine float x64 AVX512 Sleef_sinf16_u10
sine dobuel Arm NEON Sleef_sind2_u10
sine double Arm SVE Sleef_sindx_u10sve
sine double x64 AVX512 Sleef_sind8_u10

NumPy has the universal intrinsic for multiple ISAs, so I think there is a way to use this to implement transcendental functions in a unified way. However, it would be time-consuming and difficult to implement various transcendental functions. I think it would be a good idea to divert SLEEF.

Since transcendental function processing is vectorized by SLEEF, the expected performance gain will be close to N, where N means the number of SIMD lanes. In practice, the gain will be smaller than N due to Python and other overhead.

Thank you.

mattip commented 1 year ago

What version of NumPy did you test to get your performance graphs? On what platform? We have already moved some of these functions to universal intrinsics, which is why I ask for exact platform and version information. It would be great if you could report import sys, numpy; print(numpy.__version__); print(sys.version) If you are running NumPy 1.24+, also show print(numpy.show_runtime())

yamadafuyuka commented 1 year ago

Thank you for your comment. Sorry for the late reply. The environment is as follows:.

yamadafuyuka commented 1 year ago

I am sorry for the insufficient explanation.

For the functions defined in numpy/numpy/core/src/umath/loops_umath_fp.dispatch.c.src, I want to use the SLEEF in other architectures the same way AVX512 uses the SVML. In the current implementation, except for AVX512, NumPy uses the functions of #include <math.h>, which are scalar functions, right?

https://github.com/numpy/numpy/blob/172a1942893b5ce55abccd836fdd9f00235a6767/numpy/core/src/umath/loops_umath_fp.dispatch.c.src#L216-L238

mattip commented 1 year ago

We discussed approaches to using SIMD intrinsics in NEP 38. Specifically, we have a section about code enhancements. We did not really apply that section in the discussion to add SVML (PR #19478) other than to note

Getting SVML with BSD license is great deal, and it gonna be good base for start replacing them to universal intrinsics. Thank you!

There was a brief mention of SLEEF in that PR, but we did not consider using SLEEF instead/in addition to SVML.

Looking back over the mailing list, there is the discussion in 2015 mentioned in the SVML PR, and a recent mail from Chris Sidebottom about an effort to target aarch64.

I am not sure how I feel about integrating yet another vendored library for accelerated operations. On the one hand, we already have precedent with SVML. Integrating SLEEF would improve performance for other platforms. On the other, SLEEF's sources are twice as large as SVML, and the scope is larger. Would we then declare that we are not going to move these functions to universal intrinsics? What would we do with the code from #17587, #18101, and more? Could we do something more generic so that people who wished to could switch out SVML entirely, or use VOLK (GPL3) or simd or another library?

Maybe I am overthinking this, and we should just move forward since there is a contributor willing to do the work. I do think this should hit the mailing list.

kawakami-k commented 1 year ago

@mattip Thank you for letting me know about the previous discussions. I would consider discussing this on the mailing list.

yamadafuyuka commented 1 year ago

@mattip Thank you very much for your comment. I would consider it with @kawakami-k .

Mousius commented 1 year ago

@mattip is it worth re-visiting this as the universal intrinsics work is likely to be fairly long lived (https://github.com/numpy/numpy/pull/23603 has been open for a month now with no activity)? SLEEF could provide some short-term boost though I don't think it handles errors correctly from my initial look at it.

mattip commented 1 year ago

I don't think SLEEF is a step in the right direction, I think we should close this PR.