Open yamadafuyuka opened 1 year ago
We are trying to move towards using universal intrinsics inside NumPy, so I am not sure we would want to mix in a whole new library with a different paradigm.
What version of NumPy did you test? On what platform?
Hi, @mattip I'm implimenting SVE support for Numpy with @yamadafuyuka .
The motivation of this issue is to improve the calculation speed of transcendental functions such as sin/cos/tan/log2/log10/exp and etc. For x64 on Linux, NumPy can be built with SVML and calculation is vectorized. In my understanding, for non-x64 CPUs, compilers links NumPy with libmath.so
for the transcendental functions, that provides not-vectorized transcendental functions. Is this right?
SLEEF is one of the vectorized mathematical library. It supports multiple architecture as show in Table 1.1. Becuase the function names of SLEEF follow its naming convention, it is easy to abstract function name and write source code for multiple architectures/multiple instruction sets. The below is a function name example. u10
means that the function achieves 1.0-ULP calculation accuracy.
Transcendental function | Data type | ISA | SLEEF function name |
---|---|---|---|
sine | float | Arm NEON | Sleef_sinf4_u10 |
sine | float | Arm SVE | Sleef_sinfx_u10sve |
sine | float | x64 AVX512 | Sleef_sinf16_u10 |
sine | dobuel | Arm NEON | Sleef_sind2_u10 |
sine | double | Arm SVE | Sleef_sindx_u10sve |
sine | double | x64 AVX512 | Sleef_sind8_u10 |
NumPy has the universal intrinsic for multiple ISAs, so I think there is a way to use this to implement transcendental functions in a unified way. However, it would be time-consuming and difficult to implement various transcendental functions. I think it would be a good idea to divert SLEEF.
Since transcendental function processing is vectorized by SLEEF, the expected performance gain will be close to N, where N means the number of SIMD lanes. In practice, the gain will be smaller than N due to Python and other overhead.
Thank you.
What version of NumPy did you test to get your performance graphs? On what platform? We have already moved some of these functions to universal intrinsics, which is why I ask for exact platform and version information. It would be great if you could report import sys, numpy; print(numpy.__version__); print(sys.version)
If you are running NumPy 1.24+, also show print(numpy.show_runtime())
Thank you for your comment. Sorry for the late reply. The environment is as follows:.
1.23.3
>>> print(numpy.__version__)
0+untagged.28802.g57e71fd // Edited "1.23.3-release" version e47cbb69b
>>> print(sys.version)
3.10.7 (main, Oct 4 2022, 00:38:28) [GCC 11.3.0]
I am sorry for the insufficient explanation.
For the functions defined in numpy/numpy/core/src/umath/loops_umath_fp.dispatch.c.src
, I want to use the SLEEF
in other architectures the same way AVX512
uses the SVML
.
In the current implementation, except for AVX512
, NumPy uses the functions of #include <math.h>
, which are scalar functions, right?
We discussed approaches to using SIMD intrinsics in NEP 38. Specifically, we have a section about code enhancements. We did not really apply that section in the discussion to add SVML (PR #19478) other than to note
Getting SVML with BSD license is great deal, and it gonna be good base for start replacing them to universal intrinsics. Thank you!
There was a brief mention of SLEEF in that PR, but we did not consider using SLEEF instead/in addition to SVML.
Looking back over the mailing list, there is the discussion in 2015 mentioned in the SVML PR, and a recent mail from Chris Sidebottom about an effort to target aarch64.
I am not sure how I feel about integrating yet another vendored library for accelerated operations. On the one hand, we already have precedent with SVML. Integrating SLEEF would improve performance for other platforms. On the other, SLEEF's sources are twice as large as SVML, and the scope is larger. Would we then declare that we are not going to move these functions to universal intrinsics? What would we do with the code from #17587, #18101, and more? Could we do something more generic so that people who wished to could switch out SVML entirely, or use VOLK (GPL3) or simd or another library?
Maybe I am overthinking this, and we should just move forward since there is a contributor willing to do the work. I do think this should hit the mailing list.
@mattip Thank you for letting me know about the previous discussions. I would consider discussing this on the mailing list.
@mattip Thank you very much for your comment. I would consider it with @kawakami-k .
@mattip is it worth re-visiting this as the universal intrinsics work is likely to be fairly long lived (https://github.com/numpy/numpy/pull/23603 has been open for a month now with no activity)? SLEEF could provide some short-term boost though I don't think it handles errors correctly from my initial look at it.
I don't think SLEEF is a step in the right direction, I think we should close this PR.
Functions such as sin and log use libm except for
AVX512_SKX
, and at least in my environment SIMD instruction were not used. Therefore, I added implementation to use SIMD library SLEEF ( https://sleef.org/ ) and measured the calculation time of some functions. My branch: ( https://github.com/yamadafuyuka/numpy/tree/add_SLEEF )I graphed the results. We also confirmed that using SVE intrinsics as in ( PR-22265 ) further speeds up (the log10 function is about 4 times faster). I would like to add SLEEF support, but I am not sure which part of NumPy is the best place to implement it. Could you please advise?