Open programmerjake opened 3 years ago
Funding available! Provisionally assigning EUR 2000, to be adjusted as needed.
From the Libre-SOC bugtracker:
We'd need this library because a SimpleV-capable math library doesn't exist, we need to build one. We can share the implementation costs with Rust's std::simd (which also wants a vector math library without libc, libm, or OS dependencies for use in Rust's
core
library). In particular, this means the math library will have generic vector-length-independent implementations written in Rust.
I'm unfamiliar with Kazan, but if both projects use LLVM perhaps most of the implementation work here should be handled in LLVM?
It seems to me like the most universal approach would be to provide a library as part of the LLVM project that implements these functions, similar to (or maybe even part of?) compiler-rt.
We'd also want it to work with cranelift, so having it be part of LLVM may not be best.
For clarity: It would still be written in Rust?
@thomcc That's the plan...Rust seems like the best language for core::simd
.
Right, I just wasn't sure it met your other requirements.
Kazan is also written in Rust.
there can either be a vector abstraction layer that compiles out to nothing in std::simd
and compiles to code for generating IR in Kazan, or Kazan can just use the spir-v backend for rustc, or save the generated llvm & cranelift ir.
A rough sketch of how I think we could have the vector math library's API that the vector math functions would be implemented using:
pub struct F16(u16); // to be replaced by built-in type when Rust gains support
/// reference used to build IR for Kazan; an empty type for `core::simd`
pub trait Context: Copy {
// todo: add missing type conversions such as i32 -> f64 and i64 -> u8
// scalar types
type Bool: Bool + Make<Self, Prim = bool> + Select<Self::Bool>;
type U8: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u8>;
type I8: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i8>;
type U16: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u16>;
type I16: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i16>;
type F16: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = F16>;
type U32: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u32>;
type I32: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i32>;
type F32: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f32>;
type U64: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u64>;
type I64: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i64>;
type F64: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f64>;
// Vector types
type VecBool: From<Self::Bool> + Bool + Make<Self, Prim = bool> + Select<Self::VecBool>;
type VecU8: From<Self::U8> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u8>;
type VecI8: From<Self::I8> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i8>;
type VecU16: From<Self::U16> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u16>;
type VecI16: From<Self::I16> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i16>;
type VecF16: From<Self::F16> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = F16>;
type VecU32: From<Self::U32> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u32>;
type VecI32: From<Self::I32> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i32>;
type VecF32: From<Self::F32> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f32>;
type VecU64: From<Self::U64> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u64>;
type VecI64: From<Self::I64> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i64>;
type VecF64: From<Self::F64> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f64>;
fn make<T: Make<Self>>(self, v: T::Prim) -> T {
T::make(self, v)
}
}
pub trait Make<Context>: Sized {
type Prim;
fn make(ctx: Context, v: Self::Prim) -> Self;
}
pub trait Number: Compare + Add<Output = Self> + Sub<Output = Self> + Mul<Output = Self> + Div<Output = Self> + Rem<Output = Self> + AddAssign + SubAssign + MulAssign + DivAssign + RemAssign {}
pub trait BitOps: Copy + And<Output = Self> + Or<Output = Self> + Xor<Output = Self> + Not<Output = Self> + AndAssign + OrAssign + XorAssign {}
pub trait Int<ShiftRhs>: Number + BitOps + Shl<ShiftRhs, Output = Self> + Shr<ShiftRhs, Output = Self> + ShlAssign<ShiftRhs> + ShrAssign<ShiftRhs> {}
pub trait UInt<ShiftRhs>: Int<ShiftRhs> {}
pub trait SInt<ShiftRhs>: Int<ShiftRhs> + Neg<Output = Self> {}
pub trait Float: Number + Neg<Output = Self> {
fn abs(self) -> Self;
fn trunc(self) -> Self;
fn ceil(self) -> Self;
fn floor(self) -> Self;
fn round(self) -> Self;
fn fma(self, a: Self, b: Self) -> Self;
fn is_nan(self) -> Self::Bool;
fn is_infinity(self) -> Self::Bool;
fn is_finite(self) -> Self::Bool;
}
pub trait Bool: BitOps {}
pub trait Select<T>: Bool {
fn select(self, true_v: T, false_v: T) -> T;
}
pub trait Compare: Copy {
type Bool: Bool + Select<Self>;
fn eq(self, rhs: Self) -> Self::Bool;
fn ne(self, rhs: Self) -> Self::Bool;
fn lt(self, rhs: Self) -> Self::Bool;
fn gt(self, rhs: Self) -> Self::Bool;
fn le(self, rhs: Self) -> Self::Bool;
fn ge(self, rhs: Self) -> Self::Bool;
}
A math function would be implemented like so:
pub fn sincospi<C: Context>(ctx: C, mut v: C::VecF64) -> (C::VecF64, C::VecF64) {
// todo handle non-finite
v *= ctx.make(0.5);
v -= v.floor();
v *= ctx.make(4.0);
let quadrant = v.floor().as_i64();
v -= v.floor();
// v now in range of 0 to 90 deg
// use first few terms of taylor series of sin(x*pi/2) and cos(x*pi/2) -- needs adjusting for accuracy; numbers likely incorrect
let v_sq = v * v;
let s = ctx.make(-0.004681754135318687);
let s = s * v_sq + ctx.make(0.07969262624616703);
let s = s * v_sq + ctx.make(-0.6459640975062462);
let s = s * v_sq + ctx.make(1.570796326794897);
let s = s * v;
// s is now sin(v * pi / 2)
let c = ctx.make(-0.02086348076335296);
let c = c * v_sq + ctx.make(0.253669507901048);
let c = c * v_sq + ctx.make(-1.23370055013617);
let c = c * v_sq + ctx.make(1.0);
// c is now cos(v * pi / 2)
let bit0 = (quadrant & ctx.make(1)).eq(ctx.make(1));
let bit1 = (quadrant & ctx.make(2)).eq(ctx.make(2));
let c_neg = bit0 ^ bit1;
let s_neg = bit1;
let swap = bit0;
let abs_sin = swap.select(s, c);
let abs_cos = swap.select(c, s);
let cos = c_neg.select(-abs_cos, abs_cos);
let sin = c_neg.select(-abs_sin, abs_sin);
(sin, cos)
}
Started an implementation at https://salsa.debian.org/Kazan-team/vector-math
not sure about the exact requirements you have, but rust-gpu settled on glam.
requirements are to implement all scalar functions required by Vulkan, vectorized (e.g. let x: f32 = f32::sin(y)
would be vectorized into something like let x: f32x16 = f32x16::sin_with_mask(y, mask);
), for f16, f32, and f64 in terms of other vector functions so they ultimately get translated to fadd, fsub, fmul, fdiv, fma, fsqrt, fp compares, and integer ops. The full list is given here (only the scalar fp ops need to be implemented):
https://www.khronos.org/registry/spir-v/specs/1.0/GLSL.std.450.html
additional requirements are to implement all fp functions desired by std::simd
.
nice to have: also implement (needed for OpenCL support):
*pi
variants of trigonometric functions (e.g. sinpi
), they will be much easier to implement correctly than the non-*pi
functions since the non-*pi
functions require complicated high-precision argument reduction (x mod 2*pi
) to give the correct answer for large inputs.log2
, log10
, log1p
, exp2
, exp10
, expm1
cbrt
All functions must at least meet Vulkan accuracy requirements: https://www.khronos.org/registry/vulkan/specs/1.2/html/chap36.html#spirvenv-precision-operation
not sure about the exact requirements you have, but rust-gpu settled on glam.
from what I can tell, the only functions glam has implemented that would meet the above requirements is exp and pow, except that those (or at least exp, didn't check pow's implementation) are implemented by just doing a series of scalar exp operations, defeating the point of having a faster-than-scalar math library. Also, glam only supports up to vec4, whereas Kazan needs generic vector lengths of up to 64.
Thanks anyway!
Got all the vector traits wired up for use with scalars (where all vectors are length 1 for testing purposes or otherwise) and with generating demo compiler IR. Still need to wire up std::simd
support.
https://salsa.debian.org/Kazan-team/vector-math/-/blob/7975aa9639f3a5a702b130a7cf992ffe71c86e2a/src/ir.rs#L1547 The following Rust:
fn f<Ctx: Context>(ctx: Ctx, a: Ctx::VecU8, b: Ctx::VecF32) -> Ctx::VecF64 {
let a: Ctx::VecF32 = a.into();
(a - (a + b - ctx.make(5f32)).floor()).to()
}
(the to()
function is a replacement for the as
keyword)
Generates the following demo IR:
function(in<arg_0>: vec<U8>, in<arg_1>: vec<F32>) -> vec<F64> {
op_0: vec<F32> = Cast in<arg_0>
op_1: vec<F32> = Add op_0, in<arg_1>
op_2: vec<F32> = Sub op_1, splat(0x40A00000_f32)
op_3: vec<F32> = Floor op_2
op_4: vec<F32> = Sub op_0, op_3
op_5: vec<F64> = Cast op_4
Return op_5
}
Opinions on ease of use for writing functions like f
?
Implemented ilogb
. Started implementing the std::simd
bindings, only to discover that I forgot that boolean vectors need different types for different element sizes.
I added bindings for std::simd
:
https://salsa.debian.org/Kazan-team/vector-math/-/commit/d77893b257c86d7ddc81f3772b5e143ce768d291
Looking at the assembly generated for stdsimd::tests::do_ilogb_f32x4
, it looks pretty reasonable, everything's inlined, however there is still some branching that llvm didn't optimize out:
stdsimd::tests::do_ilogb_f32x4
Compiled using:
cargo rustc --release --tests --features=ir,fma,stdsimd -- --emit=asm -C target-cpu=native
on a Ryzen 3900X on Ubuntu
I added implementations of sin_pi
and cos_pi
for f16
and f32
, I tested all possible inputs, the functions are accurate to 2ULP though aren't guaranteed to give the correct sign for zeros.
Testing the f32
functions required writing a custom reference implementation of sin_cos_pi
since (x * f64::consts::PI).sin_cos()
isn't accurate enough when the exact mathematical value of either sin
or cos
is close to zero (e.g. sin_pi(1.0) == 0.0
but f64::consts::PI.sin() != 0.0
due to round-off error).
I added sin_pi and cos_pi for f64, as well as adding abs, copy_sign, and trunc for all of f16/f32/f64.
I added round_to_nearest_ties_to_even, ceil, floor, and tan_pi
Added count_leading_zeros, count_trailing_zeros, and count_ones for all of u8/16/32/64 and i8/16/32/64
What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?
What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?
Well, I'm hoping to implement more than just transcendental float functions, I'm aiming more for all the non-trivial library functions (more or less). Also, idk if cranelift has built-in support for bit counting functions. The bit counting functions are in the list in #14...
Also, idk if cranelift has built-in support for bit counting functions.
Cranelift has the clz and ctz instructions. They don't seem to support vectors yet though. Wasm doesn't have SIMD bit counting functions yet: https://github.com/WebAssembly/simd/issues/6
Apologies for popping in as the new guy.
If you are interested. I'm writing a crate called doctor_syn
as part of the extendr
R language extension project.
This is a small computer algebra system that operates on Rust expressions and amongst other things generates polynomial approximations for arbitrary expressions, such as sin or cos, or anything you can express as a series or integral.
The intention is to generate sets of trancendental and stats functions to variable precision.
The nice thing is that the compiler can generate functions "to order" in procedural macros. But you could just run it to a fixed number of terms for this limited use case.
In games we often had "high throughput" and "low latency" variants.
I think we have two choices here--we can implement bit-counting functions directly in rust if we don't think any architectures have optimized SIMD implementations--but then that should be implemented directly in stdsimd. If there are specific instructions for them we should use the compiler codegen, which means cranelift would need to implement it for SIMD eventually. Either way, I don't think it belongs in a separate library, which just acts as a libc alternative.
I seem to recall from before that a vector math library called SLEEF was also seeking to be integrated into LLVM. How is vector-math
going to be different from it?
Our requirements (primarily the part where we want rustc to be able to inline the functions) lead to it needing to be implemented in Rust, which is a big difference.
Also, there's an abstraction layer allowing the vector math functions to generate compiler IR instead of doing the operations directly (not required for core::simd
, but very useful for Kazan and other projects that implement compilers)
Sorry to pop up again.
I'm trying to refine my best attempt at the trig functions which are very easy as they are periodic and well behaved, ie. converge quickly as McLaurin series.
Here is a version for f32, but SIMD versions should be the same.
fn sin(x: f32) -> f32 {
let x = x * (1.0 / (std::f32::consts::PI * 2.0));
let x = x - x.floor() - 0.5;
12.268859941019306_f32
.mul_add(x * x, -41.216241051002875_f32)
.mul_add(x * x, 76.58672703334098_f32)
.mul_add(x * x, -81.59746095374902_f32)
.mul_add(x * x, 41.34151143437585_f32)
.mul_add(x * x, -6.283184525811273_f32)
* x
}
This gives a short instruction sequence with moderate latency (All 4 cycles per instruction on Skylake).
example::sin:
vmulss xmm0, xmm0, dword ptr [rip + .LCPI0_0]
vroundss xmm1, xmm0, xmm0, 9
vsubss xmm0, xmm0, xmm1
vaddss xmm0, xmm0, dword ptr [rip + .LCPI0_1]
vmulss xmm1, xmm0, xmm0
vmovss xmm2, dword ptr [rip + .LCPI0_2]
vfmadd213ss xmm2, xmm1, dword ptr [rip + .LCPI0_3]
vfmadd213ss xmm2, xmm1, dword ptr [rip + .LCPI0_4]
vfmadd213ss xmm2, xmm1, dword ptr [rip + .LCPI0_5]
vfmadd213ss xmm2, xmm1, dword ptr [rip + .LCPI0_6]
vfmadd213ss xmm2, xmm1, dword ptr [rip + .LCPI0_7]
vmulss xmm0, xmm0, xmm2
ret
ie. About 48 cycle latency with a 6 cycle throughput.
The method uses the Newton polynomial method I mentioned with special attention to the endpoints.
The table-makers dilemma limits us to 2-3 ulp (measured from the maximum of +/- 1.0) and adding more terms does not improve this - and in fact may make it worse.
The same kernel executed in f64, convered to f32 gives < 0.5 ulp for the full 0..PI*2 range. Adding more terms to the f64 expansion will improve f32 ULP, but not f64 ulp which requires some nasty tricks to solve the TMD.
I've been planning to write a paper on this and may get around to it in some future life.
Note that we can approximate the whole range in a single polynomial. Using sin-cos and select shortens the range but reduces throughput.
It is expected that in a loop the compiler will put together 4 or more similar operations:
pub fn f(x: &mut [f32]) {
for v in x {
*v = sin(*v);
}
}
Does in fact do this.
Let me know what you think. I can make some similar kernels for the other "standard" functions if you are interested.
Implemented sqrt_fast_f16
/f32
/f64
which are accurate to 2/3/2 ULPs respectively. No libm needed!
I am working on sin, cos, exp, log.
I'm hoping we can do tan, sec, csc without divides.
For inverse trig, a good atan2 is a good place to start as all the rest can be derived from this.
I can go through the doctor_syn code if you want to do some more.
Could someone help me to add these implementations. I am also very happy to support other doing this,
Where should they go? How do we test them formally etc.
Any suggestions welcome.
I've added a PR #126 @programmerjake @calebzulawski
I've only added the sin(f32) function as a placeholder, but I can generate others using libmgen
in the doctor_syn
crate (contributors welcome).
Could someone help me to add these implementations. I am also very happy to support other doing this,
Where should they go?
I'd advocate for them going in https://salsa.debian.org/Kazan-team/vector-math which I'm planning on merging into core
kinda like compiler-builtins is. They should be built to use the abstraction layer in crate::traits::Context
(best viewed using rustdocs), example:
https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/base.rs#L57
How do we test them formally etc.
For f16
/f32
functions, it's feasible to test all possible inputs for unary functions; for f64
, I'm just making-do with testing every Nth input bit-pattern.
https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/trig_pi.rs#L884
I have a dedicated CI runner for vector-math (shared between all Kazan and Libre-SOC projects), so I just run the tests for an hour. https://salsa.debian.org/Kazan-team/vector-math/-/pipelines/257995/builds
Started thread on llvm-dev: https://lists.llvm.org/pipermail/llvm-dev/2021-June/150965.html
We now have most of the basic libm functions in scalar form.
https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs
I need to add a scalar to simd transform, more accurate versions, and some of the "pedantry" of the various standards as optional extras.
Examples are preserving the input of sin(x) for low magnitude of x. ie. sin(x) = x for small x including -0 and +0.
Different standards disagree on where the errors should occur and others leave it quite open. The game industry, for example, does not need small value corrections.
I have also considered generating LLVM code to generate the IR.
is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions? for example in speed and accuracy, on multiple platforms?
We now have most of the basic libm functions in scalar form.
https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs
Sorry, some of those functions are horribly broken:
exp2(384.0)
returns -Inf
! it should never return a negative value, much less negative infinity! you can also get it to produce NaN and most any f32 value.
is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions? for example in speed and accuracy, on multiple platforms?
not yet...there is a test suite that compares it's accuracy against the correct mathematical results (or as close as is easyish to get -- it tests f16
/f32
against the f64
function from std
rounded to f16
/f32
, the f64
versions are tested against the rug
bindings to MPFR, which is known to produce correct results). The functions are documented how accurate they are -- if they don't have documentation they should be 100% accurate (such as copy_sign
), assuming I didn't miss anything.
On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning.
Having different implementations available could be quite useful, e.g.
For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here and here on a fast exp
implementation useful for imprecise things like calculating a softmax quickly. This example is probably somewhere in the fast
or really fast and dirty
region, and I wouldn't suggest re-using it as corners have been cut :)
I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients.
We should support different accuracies and levels of pedantry if we want game Devs to use the functions.
My current libm is pedantry free and good to a fixed max error which is expected game dev behaviour.
We can add more cycles to make the library POSIX compliant, but this is not desirable behaviour for the expert user and so should be configurable.
An example is cos and sin which use a lookup table in POSIX impls to handle four quadrants. This carries a very high cost to Get one more bit of accuracy, possibly four times slower.
On Sun, 13 Jun 2021, 11:48 Michael Barber, @.***> wrote:
On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning.
Having different implementations available could be quite useful, e.g.
- accurate
- fast
- really fast and dirty
For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx_avx.rs and here https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx.rs on a fast exp implementation useful for imprecise things like calculating a softmax quickly. This example is probably somewhere in the fast or really fast and dirty region, and I wouldn't suggest re-using it as corners have been cut :)
I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rust-lang/stdsimd/issues/109#issuecomment-860190337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XFFZ6HLGF7YGK4PZRLTSSEH7ANCNFSM43RZE7PA .
An example is cos and sin which use a lookup table in POSIX impls to handle four quadrants. This carries a very high cost to Get one more bit of accuracy, possibly four times slower.
The sin_pi
and related functions I have implemented so far use two polynomial evaluations, then use select
to get the right sign and to switch between sin
and cos
. On CPUs with a pipelined FMA or pipelined mul and add units (basically everything modern other than microcontrollers), the polynomials can run in just 2-3 more cycles (not tested, but I am a cpu designer for Libre-SOC) than a single polynomial evaluation, since the FMAs run interleaved with each other.
You are right, this is a good way to get an accurate cos and sin and indeed has lower latency. If you are calling a function occasionally, you want to optimise for latency over throughput.
If you are in a loop that has been unrolled, each of the operations will be done four or more times by the compiler to fill in the gaps between the instructions (latency/cpi) increasing throughput. For this case, having fewer instructions is best and we may be willing to sacrifice the last two bits.
I wouldn't want to call what would be preferred. Even the modulus and scaling steps would be considered optional if you stored Euler angles as small integers, for example.
In practice, making a super fast function will be rendered useless by the memory access speed which will always slow you down in the loop case and the I-cache performance in the "single" use case. Big game engines are nearly always I-cache bound, I discovered at Sony. We spent a lot of time making the instructions shorter, but LLVM does not do this because all their tests are small loops.
I was inspired by @programmerjake to try the quadrant version of sin/cos
The conditional negation of the cos and sin results is a bit hacky and could be replaced by val ^ ((x & 1) << 31)
As a result of the shorter range, we can also reduce the number of terms and so it may be as fast as the "fast" version anyway!
The godbolt output looks not too bad, although LLVM is not scheduling the result optimally.
I also need to add a simdifying transform to the functions.
fn sin(x: f32) -> f32 {
let x = x * (1.0 / (std::f32::consts::PI));
let xh = x + 0.5;
let xr = x.round();
let xhr = xh.round();
let s = x - xr;
let c = xh - xhr;
let sr = (-0.5878112252588481_f32)
.mul_add(s * s, 2.5496484487855455_f32)
.mul_add(s * s, -5.167705042516404_f32)
.mul_add(s * s, 3.1415926342503133_f32)
* s;
let cr = (0.23132925050778008_f32)
.mul_add(c * c, -1.335050718961723_f32)
.mul_add(c * c, 4.0587078433143_f32)
.mul_add(c * c, -4.934802176219238_f32)
.mul_add(c * c, 0.9999999999999999_f32);
let ss = if (xr as i32) & 1 == 0 { sr } else { -sr };
let cs = if (xhr as i32 & 1) == 0 { -cr } else { cr };
if s.abs() <= 0.25 {
ss
} else {
cs
}
}
WIP implementation: https://salsa.debian.org/Kazan-team/vector-math
Problem Statement: Vector Math functions are not widely available and can’t be used from
core
due to needing to link to the external vector math library, therefore we are building a math library for Rust that can be used everywhere and doesn’t require external dependencies allowing it to be used incore
.This allows
SimdF32::sin()
and similar to then work everywhereSimdF32
or similar works (basically everywhere). This includes WebAssembly (with or without thesimd128
extension), Microcontrollers, etc. Also, where actual SIMD instructions are available, it will be faster than just callingf32::sin
a bunch of times.We will want to extend LLVM to use our Vector Math library as the fallback implementation for LLVM's vector math intrinsics (e.g.
llvm.sin.v8f32
). The reason we want to go through all the hassle of getting LLVM to generate calls to our library is then because LLVM can then generate the nativesin
instruction(s) where supported and call our library otherwise. Leaving it up to LLVM to decide is by far the best option since it can do const-propagation and other optimizations based on its knowledge of howsin
ought to behave, and LLVM is the best spot to make target-specific decisions. This means we can’t just use LLVM’s existing features for a vector math library but require it to learn how to generate calls to our Vector Math library when native instructions aren't available. Example of adding libcall support to LLVM: https://reviews.llvm.org/D53927Implementation: https://salsa.debian.org/Kazan-team/vector-math
The Vector Math library is being written in
#![no_std]
Rust, along with an abstraction layer allowing the implemented functions to also be used in Kazan (a Vulkan GPU driver being developed for Libre-SOC, who is helping funding development, see bug on Libre-SOC's bugtracker).The abstraction layer will have four implementations, the first three of which are currently implemented:
core::simd
.Kazan needs all the functions that Vulkan requires (basically all the sin, cos, tan, atan, sinpi, cospi, exp, log2, pow, etc. functions), if we work together on the same library we could also use it for Rust's
std::simd
, potentially saving a bit of work. Both Kazan and rustc share backends (LLVM and cranelift) and would like to support having vector math functions inlined into calling code, giving a potentially substantial performance boost.Implemented functions so far:
f16
/f32
/f64
:abs
,copy_sign
,trunc
,round_to_nearest_ties_to_even
,floor
,ceil
ilogb
sin_pi
,cos_pi
,sin_cos_pi
,tan_pi
sqrt_fast
u8
/u16
/u32
/u64
/i8
/i16
/i32
/i64
:count_leading_zeros
count_trailing_zeros
count_ones