Impl special functions for SIMD

programmerjake commented 3 years ago

Need all of:

[x] div_euclid/rem_euclid
[x] clamp
[x] max/min
[ ] rotate_left/rotate_right
[x] swap_bytes/reverse_bits
[x] saturating_add/saturating_sub
[x] saturating_neg/saturating_abs
[ ] saturating_mul
[x] ~~wrapping_add/wrapping_sub/wrapping_mul/wrapping_pow~~
[x] ~~wrapping_div/wrapping_rem/wrapping_div_euclid/wrapping_rem_euclid~~
[x] ~~wrapping_neg/wrapping_abs~~
[ ] overflowing_add/overflowing_sub
[ ] overflowing_mul
[ ] overflowing_div/overflowing_div_euclid
[ ] overflowing_rem/overflowing_rem_euclid
[ ] overflowing_neg/overflowing_abs
[ ] overflowing_shl/overflowing_shr
[ ] from_be/from_le/to_be/to_le
[x] to_be_bytes/to_le_bytes/from_be_bytes/from_le_bytes
[X] {to,from}_ne_bytes

for integers:

[x] leading_zeros/trailing_zeros
[x] leading_ones/trailing_ones
[ ] count_ones/count_zeros
[x] ~~pow~~
[x] ~~overflowing_pow~~
[x] ~~saturating_pow~~
[x] ~~wrapping_shl/wrapping_shr~~

for floats:

[ ] trig./hyperbolic functions: #6
[x] recip
[X] mul_add
[ ] powi/powf
[x] to_int_unchecked
[x] to_degrees/to_radians
[x] sqrt
[ ] cbrt
[ ] hypot
[x] exp/exp2/ln/log/log2/log10
[ ] exp_m1/ln_1p

for signed integers and floats:

[x] abs
[X] signum
[X] copysign
[x] is_positive/is_negative

See also #109

Lokathor commented 3 years ago

I don't believe we have non-overflowing/non-wrapping ops actually.

That is, we only have the wrapping version.

programmerjake commented 3 years ago

Having ops that panic on overflow (like Rust's standard integer ops in debug mode) seems like something that would be useful for debugging, even if it has a runtime penalty. It could be disabled by Release mode, like usual.

calebzulawski commented 3 years ago

I would like to add as_slice and as_array functions to this list.

I think we need to be careful with rotate_left and rotate_right--it's unfortunate that std uses the same name for rotating slice elements and rotating bits (both of these cases apply to SIMD vectors)

programmerjake commented 3 years ago

how about naming them rotate_lanes_left/right and rotate_bits_left/right?

Lokathor commented 3 years ago

Update: removing floor/ceil/round/trunc/fract from the list, opened https://github.com/rust-lang/stdsimd/issues/23 instead.

thomcc commented 3 years ago

for floats:

* [ ]  trig./hyperbolic functions: #6

* ...

* [ ]  cbrt

* ...

* [ ]  exp/exp2/ln/log/log2/log10

* [ ]  exp_m1/ln_1p

So... Is there a reason that these are considered required rather than nice to have? Are there architectures that offer this?

I'm not opposed to it (I was working on a SSE cbrt yesterday, so I agree these aren't useless, but there's also a lot of work, and users quite reasonably might want to make different performance/accuracy tradeoffs here. Also, god, properly supporting rounding modes in these is a whole damn can of worms — but hopefully we'll just continue with the good ol rust standby of pretending rounding mode can never change).

Anyway, if we're going for ieee754 recommended operations, there are some missing from the recommended set as of 754-2019. I've attached a screenshot of the relevant table.

Screenshot of IEEE754-2019 additional recommended operations

Table 9.1—Additional mathematical operations

Table 9.1—Additional mathematical operations (continued)

Note: rSqrt there is the accurately-rounded of inverse sqrt — specifically, it's not an equivalent to _mmN_rsqrt_ps (it is equivalent to the _mmN_invsqrt_ps that you can get in some places), which is approximated (but we should still expose an approximate rsqrt, since e.g. intel supports it and inverse sqrt is a very common operation in some areas).

programmerjake commented 3 years ago

for floats:
* [ ]  trig./hyperbolic functions: #6

* ...

* [ ]  cbrt

* ...

* [ ]  exp/exp2/ln/log/log2/log10

* [ ]  exp_m1/ln_1p
So... Is there a reason that these are considered required rather than nice to have? Are there architectures that offer this?

I just went down the list of functions on f32.

Libre-SOC may potentially provide vector instructions for all of the functions you mentioned, we are almost certainly providing instructions for exp, exp2, ln, log2. IIRC AMDGPU provides some exponential and logarithm functions.

thomcc commented 3 years ago

Hm, okay. Some concerns I'd have, mostly since you mentioned GPUs (which tend to answer these questions by picking whatever is fastest — and honestly somewhat fairly, a lot of these are super expensive to handle correctly in SIMD code):

Are non-finite inputs handled properly? If not, how improper?
- -ffast-math-style UB?
- consistent-but-garbage results?
- consistent-but-fixable results? (e.g. wrong sign when returning nan or whatever)
Ditto, but for other out-of-domain inputs — like neg inputs to sqrt.
Are denormals (other than zero) handled properly?
- Here proper just means "correct result".
- I'm only excluding zero because it's unfathomable that it would be broken on 0.0 (assuming 0.0 is part of the function's domain).
Is the current rounding mode respected?
- If applicable, are other relevant aspects of the fp env respected?
- Note: This is probably not relevant on GPUs, but it is for us (I think? *).
Does the function produce a precise (max error within 1ulp) result, or is it approximated?

And if not, what do we do?

Also relevant to our fallback: I don't think I've ever seen SIMD implementations of this stuff that actually is true for all these. The vectorclass code linked elsewhere appears not to handle all of this (but I didn't look too closely and perhaps it's doing it by structuring the code so it's handled automatically), and IIRC, sleef didn't used to but maybe it does now.

And to be clear, I'm not saying our fallback implementation has to handle these issues (although certainly we would in an ideal world), but if it doesn't that should be intentional.

Also, I guess the fallback could just be extracting each lane and calling libm on it (although this would either require rust libm, which is pretty slow, or force this stuff into libstd).

* Regarding 4, I vaguely remember hearing it was UB in rust to change the float env? Possibly because LLVM can't fully handle it, or constant propagation, or who knows. Perhaps we don't really need to handle this if that's the case. I also don't know if this is actually true.

Lokathor commented 3 years ago

yeah llvm ignores floating point environment currently during optimization, so if we do anything other than the same we get code that changes based on optimization level, which is classic UB.

they're developing alternative llvm ir that would let you follow the fp environment, but currently it's not ready (last i heard around the start of the year).

thomcc commented 3 years ago

Personally, IME changing fpenv is a huge headache and you're better off structuring your code so that it's not needed, even if that means you have to do some computations negated or whatever.

This is the one of these I'm least willing to go to bat for as something we should support at all (in truth, I'd be happy for someone to tell me it's totally unsupported and code can assume default rounding mode). This certainly makes the impl of these functions simpler / easier to test).

That said IDK, the Rust libm seems to handle it... I assume we need to also. (And I mean, it might be a part of floating point I don't like, but it is a part of it)

... Also, I just realized I forgot to mention fp status registers and triggering the right fp exceptions, if relevant. Anyway, just assume that list of concerns is #[non_exhaustive]

Lokathor commented 3 years ago

Oh, libm is just wrong in that area. Most of our libm code is just blindly copied from C. The thing is that libm gets too little attention for anyone to care, so oh well.

programmerjake commented 3 years ago

AMDGPU supports infinities, NaNs (though I don't know which values it produces), signed zeros, and different rounding modes. It has 1ULP accuracy for exp2, and log2. Other exp/log instructions are implemented in terms of those.

Libre-SOC will have at least 2 modes, one which is only as accurate as Vulkan requires (though if we can provide more than that without much more hardware, we probably will), and one which is supposed to provide the correctly-rounded results specified by IEEE 754 for all supported rounding modes. The second mode may just trap to a software implementation for some of the more complex instructions though, so could be very slow. We haven't decided yet.

workingjubilee commented 3 years ago

I think it makes sense to right now go with saying that exposing special float ops on SIMD types should currently be a relatively strong statement of "you probably can't beat this speed/accuracy tradeoff" and then implementing the rest (and weighing different speed/accuracy tradeoffs) can be its own ongoing/extended discussion.

So if all the relevant vector processors reasonably consistently provide fast and accurate exp/log functions, then we want to expose those right away, and start to set aside other things we know will require more thought.

workingjubilee commented 3 years ago

I was not able to find integral pow functions on Intel or Arm intrinsic lists, and so have struck them from the lists. There are hardware accelerated floating point operations for this, of course.

Lokathor commented 3 years ago

I think we should have Pow on the extended list, wherever that is, even if it is always "library provided" and never actually hardware.

workingjubilee commented 3 years ago

It would be useful to carve up things between what we can expect to have efficient/fast hardware acceleration for and those that are reasonable but software-only, yes, for the sake of prioritization.

TennyZhuang commented 2 years ago

Why wrap* were removed here? In my opinion, wrap* should always do overflow check and return an Option<Simd<_>>, which is different from the behavior of primitive ops (not check on release, check and may panic on debug).

workingjubilee commented 2 years ago

Simd<T, N> is implicitly Simd<Wrapping<T>, N>. What you describe is the behavior of the checked_* ops.

ghost commented 2 years ago

(I didn't find a tracking issue for checked_*, which is where I would have commented; should you open one?)

It is quite reasonably expected that checked_* operations would be slower than the wrapping equivalents, but I'm not sure what implementations you all have in mind for most checked_* operations?

E.g. addition, checked_add(x, y) should only perform an estimated ~3-5 extra operations, given that the branch statement is if SIMD::saturating_add(x, y) == x + y?

programmerjake commented 1 year ago

bitwise rotate left/right came up in https://github.com/rust-lang/portable-simd/issues/328#issuecomment-1414482407 (actually most of that issue was discussing rotations rather than chacha20)

avhz commented 6 months ago

Any updates on this issue? I would offer to help, but I suspect it's above my skill level. But I'm particularly interested in using special functions for SIMD floats (exp, log, etc).

calebzulawski commented 6 months ago

No updates--the place to start will be adding more intrinsics to the compiler and then use them in the StdFloat trait.

avhz commented 6 months ago

Can't promise anything, but I'll take a look at what's currently done, and if it seems achievable I'll have a go.

rust-lang / portable-simd

Impl special functions for SIMD #14