Pow function and SLEEF - Githubissues

google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch

Apache License 2.0

4.21k stars 319 forks source link

Pow function and SLEEF #1650

Open bnprks opened 1 year ago

bnprks commented 1 year ago

Hi, I'm considering implementing Pow function pull request for highway and wanted to check scope/goals before jumping in too much further.

My rough plan would be:

Start from SLEEF's scalar pow implementation in sleefsp.c and sleefdp.c, and translate into highway vector ops.
As preliminary work, Implement the required subsets of SLEEF's double-double precision arithmetic including the high-precision log and exp required for its pow implementation.

I know a mechanical translation of SLEEF's functions has been discussed in this numpy issue, but I would be proposing a manual translation here, though sticking to the high-level outline of the SLEEF code.

Questions:

What is highway's policy towards correctly handling Inf and NaNs vs. skipping checks? Some relevant cases here are 3^-Inf = 0, 0.4^Inf = 0, and -2^3.5=NaN. Should all those be handled as listed, or should some return garbage in order to eliminate a few checks?
Does the plan to model off of SLEEF's scalar rather than vector code seem acceptable?
Would this conflict with any larger plan to port SLEEF's high-precision math ops to highway?

jan-wassenberg commented 1 year ago

Hi @bnprks , thank you for reaching out. This sounds very interesting and we are very happy to collaborate. Translating some of the math functions from SLEEF sounds like a great approach. FYI here's a list of the ones we don't have implemented yet, indeed Pow() is still missing.

Numpy is also interested in accurate and robust implentations. I'd suggest favoring both high precision, and correctly handling edge cases. For pow, their relative cost would be low, and we can provide some compile-time template argument or #if to opt out of them for anyone who wants.

If scalar is easier to translate for you, sure, why not. Do you see any potential concerns or tradeoffs there? (I know the many suffixes in their vector code are hard to read.)

There is currently no active effort to extend our math functions. As mentioned, it may be interesting for numpy in future.

Happy to discuss further. FYI I am out of office next week.

bnprks commented 1 year ago

Thanks for the reply! I've been doing some exploratory work looking into the Pow function and the potential for a larger automated translation of the SLEEF math functions.

What I've done so far is:

Use py-tree-sitter to do some basic data exploration of the SIMD library callgraph, as well as proof-of-concept AST-based code translation (e.g. translating a + b to Add(a, b) for arbitrary expressions a and b, with correct operator precedence)
Try manually searching for sleef <-> highway equivalents in all sleef functions that are required to implement xpowf

Translation seems tricky but possible, and I am excited about the prospect of having Sleef's high quality math implementations available in Highway.

If scalar is easier to translate for you, sure, why not. Do you see any potential concerns or tradeoffs there? (I know the many suffixes in their vector code are hard to read.)

I've been looking both at the simd and scalar sources some, and I think this is a key discussion point for adapting the sleef algorithms, especially if an automated translation might be desired.

Here's my take:

If the goal is translating the algorithms/formulas used by sleef, the scalar source is probably easier to work from since there is much less digging required to get to the base operations with known semantics (e.g. casting, arithmetic, etc.).
- It appears to me that sleef's simd source is essentially a line-by-line translation of the scalar source, at least for the xpowf function
The sleef simd version incorporates some tricks that don't come across in the scalar code (e.g. the simd version of this scalar line takes advantage of the fact that a masking operation with all 1 bits is equivalent to NaN assignment).

After more review, I think I will try to translate starting from the simd source, due to the presence of simd-specific efficiency tricks. My approach will be to make semantic translations between sleef's simd abstraction layer into highway's functions which will capture some but not all of the simd optimizations in sleef.

In some cases, this will result in different underlying instructions running (e.g. sleef's vtruncate_vi2_vf(v) would translate to ConvertTo(di, v), even though Highway's version for avx512 and other x86 targets includes additional overflow-handling logic that sleef's avx12 doesn't perform.
I will avoid direct calls to simd intrinsics at all costs, as I want to make the port fit naturally into highway

Assuming this approach continues to make sense to you, I'll start by trying to develop a semi-automated translation system to translate the Pow function, then see from there if it seems possible to expand to the rest of the sleef math functions.

jan-wassenberg commented 1 year ago

Great to hear your progress. This sounds very exciting indeed! Starting from SIMD makes sense - although harder to read, there might be other tricks in there as well, for example vpternlog.

Interesting that SLEEF does not check/fix the return value of _mm512_cvttps_epi32. I'd consider its 0x80..00 to be less useful than INT*_MAX. The cost of potential bugs/misbehavior seems much higher to me than a few extra instructions in already expensive math routines.

Please don't hesitate to let us know if you'd like to use additional intrinsics. We are happy to add ops wherever something reasonable can be done for other platforms.

Assuming this approach continues to make sense to you, I'll start by trying to develop a semi-automated translation system to translate the Pow function, then see from there if it seems possible to expand to the rest of the sleef math functions.

Sounds great!

jan-wassenberg commented 1 year ago

Hi @bnprks , I am curious how it's going with the Pow function?

bnprks commented 1 year ago

I haven't done much with this lately, so I'm still at the stage of viable-seeming strategy but definitely not something working. I solved my immediate use-case by disabling SIMD for Pow (wasn't that critical), but I might take a second look at this since I think the sleef translation is an interesting and useful problem. Judging by my slow progress to date, I wouldn't expect much to come of this soon.

If you're interested in cleaning up outstanding issues we can close this and I'll reopen if I get around to porting a full Pow implementation.

If you or someone else wants to tackle this instead definitely don't wait for me. I'd be happy to share notes + the code snippets I was working on for parsing + transforming the sleef source code if desired.

jan-wassenberg commented 1 year ago

Thanks for the update. I agree this would be useful and interesting. No worries/rush, we can keep this open :)

bnprks commented 10 months ago

Hi @jan-wassenberg, I've finally gotten time to take a proper look at this over the past few weeks, and have made some good initial results!

I've started off by making a separate repository to handle the translation code, results, and testing which is available at bnprks/highway-sleef.

Most of the details are in the README there, but for a quick summary:

Current status:

All of highway's single-precision functions with 1 argument have translated equivalents from sleef
All results from the translation are bitwise identical to original SLEEF, with the exception of some NaN outputs (I've noticed some sign-mismatches on NaN's which seem to be dependent on optimization level)
- This is only with testing on AVX2, so is not guaranteed to hold in other instruction sets
See src/gen-bindings/README.md for a description of the translation code

Next steps:

Implement the rest of the SLEEF operations including double-precision
- See the SLEEF documentation for a full list of these operations. Highlights include trig functions with arguments automatically multiplied by pi, along with error and gamma function
Develop testing methodology for functions that cannot be exhaustively tested for all inputs
Test accuracy and performance on more platforms and Hwy configurations
Try to merge into upstream hwy contrib library.

Feedback areas from you

Strategy for stripped-down accuracy tests that are suitably fast to run as part of the main highway/contrib CI tests
- I'm thinking of testing a 1k-10k random log-spaced inputs, along with all combinations of per-function "special values" (e.g. Inf, 0, NaN, or pi) that might be expected to cause special problems.
Strategy for testing on other platforms (and if testing on vector types other than ScalableTag is needed)
Deciding in which cases (if any) it's worth replacing an existing function in math-inl.h with the Sleef translations
Licensing considerations.
- SLEEF is licensed under Boost 1.0, so I think a semi-automated translation should probably preserve the original license and copyright notices (along with adding new ones for the translation work). That could modestly complicate Highway's licensing situation if we merge into highway/contrib, though I'm hoping it's not a deal breaker.

I've included a lot of accuracy + performance results in the repo, but my high-level summary would be that the SLEEF functions tend to provide more accurate options than Hwy at the cost of slower execution speed. When similar-precision options are available, SLEEF and Hwy are within ~25% speed of each other, but high-precision SLEEF options can become much slower (3x baseline, but some outliers at 10-30x slower).

jan-wassenberg commented 10 months ago

Wow, looks like the automation has paid off in that you've been able to port a long list of functions. Nice work!

1k-10k random log-spaced inputs, along with all combinations of per-function "special values"

Sounds good. We can use AdjustedReps() to reduce the number of iterations in debug builds, then it should be fine.

if testing on vector types other than ScalableTag is needed

Testing on partial vectors can expose some bugs. I'm not sure we have to run for all powers of two, but ForGEVectors<128, Test> might be a good compromise. Our Github and internal CI takes care of running on various platforms.

Deciding in which cases (if any) it's worth replacing an existing function in math-inl.h with the Sleef translations

I'd trust your benchmarks and suggest the general principle that we replace any existing functions with something more exact, and no more than 10% slower. For example Exp(). For the higher-precision ones, which some users have indeed requested regardless of cost, we can put them perhaps in another file and with a name indicating the tradeoff.

On licensing, are you in contact with Prof. Shibata? If they could re-license to BSD3 or Apache2, that would be very helpful.

bnprks commented 10 months ago

Wow, looks like the automation has paid off in that you've been able to port a long list of functions. Nice work!

Thanks! Aside from special cases like SinCos returning two arguments, the incremental work is basically 1 line of config per Sleef function that needs translating (including SIMD ops and intermediate functions)

I'll try getting a faster testing setup established, though testing platforms though personally I'll probably only be able to check Intel AVX2 and below and ARM NEON with the computers I have access to. I might look into qemu testing later.

On licensing, are you in contact with Prof. Shibata? If they could re-license to BSD3 or Apache2, that would be very helpful.

I have not been in contact with Prof. Shibata. Do you think a github issue on sleef or an email to Prof. Shibata would be best?

It looks like the files used for translation have ~10 distinct contributors, though most besides Prof. Shibata's to do not touch the actual function implementations. Prof. Shibata has the vast majority of contributions, and the most active current maintainer on github appears to be Pierre Blanchard working at ARM in Manchester, UK.

EDIT: To add one question -- what would your goals/constraints be on licensing? Even if Sleef added e.g. BSD3 licensing, that would still presumably result in the derived math functions requiring special licensing treatment compared to the rest of Highway being dual-licensed.

jan-wassenberg commented 9 months ago

the incremental work is basically 1 line of config

Nice and efficient :)

I'll probably only be able to check Intel AVX2 and below and ARM NEON with the computers I have access to.

Sure, that's fine.

Do you think a github issue on sleef or an email to Prof. Shibata would be best?

I figure email is better if you have it.

On licensing, it would be nice if the ported file at least matches one of the existing Highway licenses, so that users of that file would just choose that one. Does that make sense?

bnprks commented 9 months ago

Since it's been a week thought I'd post a quick update -- I'm doing a bit more work on expanding the translations, particularly starting to tackle double-precision functions and checking non-AVX2 instruction sets. A couple new hiccups needed to be addressed, but so far nothing that seems like a real blocker. When I'm satisfied my translations are correct and cover most of Sleef's functions then we can discuss more concretely how to upstream functionality (and which parts).

I've copied a few more licensing thoughts below, but probably nothing too useful to discuss unless right now unless you happen to have knowledge of Google's legal policies regarding the Boost license used by Sleef (namely what code counts as a "derivative work" of another author's code)

Licensing thoughts

I had one additional idea, which would be just copying the [full boost license](https://www.boost.org/LICENSE_1_0.txt) into the translated header file. Boost doesn't require providing attribution with compiled binaries, so I think as long as downstream users don't delete the license/copyright comment from the source code there wouldn't be additional obligations on users of the highway library. I will admit that I'm not sure what would qualify something as becoming a "derivative work" of Boost-licensed software. If "derivative work" just covers the translated code isolated to a single file, then this solution would seem quite plausible. If "derivative work" would eventually cover large chunks of the highway library that could be problematic. (I am not a copyright lawyer, and am only familiar with US law) For the feasibility of getting a version of Sleef under BSD or Apache licensing, my main worry is the logistical difficulties of getting permission from all the contributors to relicense which might be a large ask. With permission from just Prof. Shibata, I am not sure from a legal perspective which code could be used without requiring input from others, given that Sleef is licensed under Boost but with copyright maintained by all the original contributors. Would it be code where each line has been reverted to the most recent commit from Prof. Shibata, or would it be a 2017 copy of Sleef before the first other contributors show up in the commit history?

jan-wassenberg commented 9 months ago

Thanks for sharing. On licensing, I understand the difficulty of asking all contributors. We have done this in the past, but it takes some time. Under these circumstances, I think a separate file with separate (Boost) license is reasonable.