Create a Vector Math library to allow `SimdF32::sin` and similar to work in `core`

programmerjake commented 3 years ago

WIP implementation: https://salsa.debian.org/Kazan-team/vector-math

Problem Statement: Vector Math functions are not widely available and can’t be used from core due to needing to link to the external vector math library, therefore we are building a math library for Rust that can be used everywhere and doesn’t require external dependencies allowing it to be used in core.

This allows SimdF32::sin() and similar to then work everywhere SimdF32 or similar works (basically everywhere). This includes WebAssembly (with or without the simd128 extension), Microcontrollers, etc. Also, where actual SIMD instructions are available, it will be faster than just calling f32::sin a bunch of times.

We will want to extend LLVM to use our Vector Math library as the fallback implementation for LLVM's vector math intrinsics (e.g. llvm.sin.v8f32). The reason we want to go through all the hassle of getting LLVM to generate calls to our library is then because LLVM can then generate the native sin instruction(s) where supported and call our library otherwise. Leaving it up to LLVM to decide is by far the best option since it can do const-propagation and other optimizations based on its knowledge of how sin ought to behave, and LLVM is the best spot to make target-specific decisions. This means we can’t just use LLVM’s existing features for a vector math library but require it to learn how to generate calls to our Vector Math library when native instructions aren't available. Example of adding libcall support to LLVM: https://reviews.llvm.org/D53927

Implementation: https://salsa.debian.org/Kazan-team/vector-math

The Vector Math library is being written in #![no_std] Rust, along with an abstraction layer allowing the implemented functions to also be used in Kazan (a Vulkan GPU driver being developed for Libre-SOC, who is helping funding development, see bug on Libre-SOC's bugtracker).

The abstraction layer will have four implementations, the first three of which are currently implemented:

A scalar math implementation for testing purposes.
A implementation wrapping over core::simd.
A demo implementation of generating compiler IR.
An implemententation that generates Kazan's compiler IR, allowing integration with Kazan.

Kazan needs all the functions that Vulkan requires (basically all the sin, cos, tan, atan, sinpi, cospi, exp, log2, pow, etc. functions), if we work together on the same library we could also use it for Rust's std::simd, potentially saving a bit of work. Both Kazan and rustc share backends (LLVM and cranelift) and would like to support having vector math functions inlined into calling code, giving a potentially substantial performance boost.

Implemented functions so far: f16/f32/f64:

abs, copy_sign, trunc, round_to_nearest_ties_to_even, floor, ceil
ilogb
sin_pi, cos_pi, sin_cos_pi, tan_pi
sqrt_fast

u8/u16/u32/u64/i8/i16/i32/i64:

count_leading_zeros
count_trailing_zeros
count_ones

programmerjake commented 3 years ago

Funding available! Provisionally assigning EUR 2000, to be adjusted as needed.

programmerjake commented 3 years ago

From the Libre-SOC bugtracker:

We'd need this library because a SimpleV-capable math library doesn't exist, we need to build one. We can share the implementation costs with Rust's std::simd (which also wants a vector math library without libc, libm, or OS dependencies for use in Rust's core library). In particular, this means the math library will have generic vector-length-independent implementations written in Rust.

calebzulawski commented 3 years ago

I'm unfamiliar with Kazan, but if both projects use LLVM perhaps most of the implementation work here should be handled in LLVM?

It seems to me like the most universal approach would be to provide a library as part of the LLVM project that implements these functions, similar to (or maybe even part of?) compiler-rt.

programmerjake commented 3 years ago

We'd also want it to work with cranelift, so having it be part of LLVM may not be best.

thomcc commented 3 years ago

For clarity: It would still be written in Rust?

programmerjake commented 3 years ago

@thomcc That's the plan...Rust seems like the best language for core::simd.

thomcc commented 3 years ago

Right, I just wasn't sure it met your other requirements.

programmerjake commented 3 years ago

Kazan is also written in Rust.

there can either be a vector abstraction layer that compiles out to nothing in std::simd and compiles to code for generating IR in Kazan, or Kazan can just use the spir-v backend for rustc, or save the generated llvm & cranelift ir.

programmerjake commented 3 years ago

A rough sketch of how I think we could have the vector math library's API that the vector math functions would be implemented using:

pub struct F16(u16); // to be replaced by built-in type when Rust gains support

/// reference used to build IR for Kazan; an empty type for `core::simd`
pub trait Context: Copy {
    // todo: add missing type conversions such as i32 -> f64 and i64 -> u8
    // scalar types
    type Bool: Bool + Make<Self, Prim = bool> + Select<Self::Bool>;
    type U8: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u8>;
    type I8: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i8>;
    type U16: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u16>;
    type I16: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i16>;
    type F16: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = F16>;
    type U32: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u32>;
    type I32: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i32>;
    type F32: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f32>;
    type U64: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u64>;
    type I64: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i64>;
    type F64: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f64>;
    // Vector types
    type VecBool: From<Self::Bool> + Bool + Make<Self, Prim = bool> + Select<Self::VecBool>;
    type VecU8: From<Self::U8> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u8>;
    type VecI8: From<Self::I8> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i8>;
    type VecU16: From<Self::U16> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u16>;
    type VecI16: From<Self::I16> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i16>;
    type VecF16: From<Self::F16> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = F16>;
    type VecU32: From<Self::U32> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u32>;
    type VecI32: From<Self::I32> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i32>;
    type VecF32: From<Self::F32> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f32>;
    type VecU64: From<Self::U64> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u64>;
    type VecI64: From<Self::I64> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i64>;
    type VecF64: From<Self::F64> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f64>;
    fn make<T: Make<Self>>(self, v: T::Prim) -> T {
        T::make(self, v)
    }
}

pub trait Make<Context>: Sized {
    type Prim;
    fn make(ctx: Context, v: Self::Prim) -> Self;
}

pub trait Number: Compare + Add<Output = Self> + Sub<Output = Self> + Mul<Output = Self> + Div<Output = Self> + Rem<Output = Self> + AddAssign + SubAssign + MulAssign + DivAssign + RemAssign {}

pub trait BitOps: Copy + And<Output = Self> + Or<Output = Self> + Xor<Output = Self> + Not<Output = Self> + AndAssign + OrAssign + XorAssign {}

pub trait Int<ShiftRhs>: Number + BitOps + Shl<ShiftRhs, Output = Self> + Shr<ShiftRhs, Output = Self> + ShlAssign<ShiftRhs> + ShrAssign<ShiftRhs> {}

pub trait UInt<ShiftRhs>: Int<ShiftRhs> {}

pub trait SInt<ShiftRhs>: Int<ShiftRhs> + Neg<Output = Self> {}

pub trait Float: Number + Neg<Output = Self> {
    fn abs(self) -> Self;
    fn trunc(self) -> Self;
    fn ceil(self) -> Self;
    fn floor(self) -> Self;
    fn round(self) -> Self;
    fn fma(self, a: Self, b: Self) -> Self;
    fn is_nan(self) -> Self::Bool;
    fn is_infinity(self) -> Self::Bool;
    fn is_finite(self) -> Self::Bool;
}

pub trait Bool: BitOps {}

pub trait Select<T>: Bool {
    fn select(self, true_v: T, false_v: T) -> T;
}

pub trait Compare: Copy {
    type Bool: Bool + Select<Self>;
    fn eq(self, rhs: Self) -> Self::Bool;
    fn ne(self, rhs: Self) -> Self::Bool;
    fn lt(self, rhs: Self) -> Self::Bool;
    fn gt(self, rhs: Self) -> Self::Bool;
    fn le(self, rhs: Self) -> Self::Bool;
    fn ge(self, rhs: Self) -> Self::Bool;
}

A math function would be implemented like so:

pub fn sincospi<C: Context>(ctx: C, mut v: C::VecF64) -> (C::VecF64, C::VecF64) {
    // todo handle non-finite
    v *= ctx.make(0.5);
    v -= v.floor();
    v *= ctx.make(4.0);
    let quadrant = v.floor().as_i64();
    v -= v.floor();
    // v now in range of 0 to 90 deg
    // use first few terms of taylor series of sin(x*pi/2) and cos(x*pi/2) -- needs adjusting for accuracy; numbers likely incorrect
    let v_sq = v * v;
    let s = ctx.make(-0.004681754135318687);
    let s = s * v_sq + ctx.make(0.07969262624616703);
    let s = s * v_sq + ctx.make(-0.6459640975062462);
    let s = s * v_sq + ctx.make(1.570796326794897);
    let s = s * v;
    // s is now sin(v * pi / 2)
    let c = ctx.make(-0.02086348076335296);
    let c = c * v_sq + ctx.make(0.253669507901048);
    let c = c * v_sq + ctx.make(-1.23370055013617);
    let c = c * v_sq + ctx.make(1.0);
    // c is now cos(v * pi / 2)
    let bit0 = (quadrant & ctx.make(1)).eq(ctx.make(1));
    let bit1 = (quadrant & ctx.make(2)).eq(ctx.make(2));
    let c_neg = bit0 ^ bit1;
    let s_neg = bit1;
    let swap = bit0;
    let abs_sin = swap.select(s, c);
    let abs_cos = swap.select(c, s);
    let cos = c_neg.select(-abs_cos, abs_cos);
    let sin = c_neg.select(-abs_sin, abs_sin);
    (sin, cos)
}

programmerjake commented 3 years ago

Started an implementation at https://salsa.debian.org/Kazan-team/vector-math

dvc94ch commented 3 years ago

not sure about the exact requirements you have, but rust-gpu settled on glam.

programmerjake commented 3 years ago

requirements are to implement all scalar functions required by Vulkan, vectorized (e.g. let x: f32 = f32::sin(y) would be vectorized into something like let x: f32x16 = f32x16::sin_with_mask(y, mask);), for f16, f32, and f64 in terms of other vector functions so they ultimately get translated to fadd, fsub, fmul, fdiv, fma, fsqrt, fp compares, and integer ops. The full list is given here (only the scalar fp ops need to be implemented): https://www.khronos.org/registry/spir-v/specs/1.0/GLSL.std.450.html

additional requirements are to implement all fp functions desired by std::simd.

nice to have: also implement (needed for OpenCL support):

*pi variants of trigonometric functions (e.g. sinpi), they will be much easier to implement correctly than the non-*pi functions since the non-*pi functions require complicated high-precision argument reduction (x mod 2*pi) to give the correct answer for large inputs.
log2, log10, log1p, exp2, exp10, expm1
cbrt

All functions must at least meet Vulkan accuracy requirements: https://www.khronos.org/registry/vulkan/specs/1.2/html/chap36.html#spirvenv-precision-operation

programmerjake commented 3 years ago

not sure about the exact requirements you have, but rust-gpu settled on glam.

from what I can tell, the only functions glam has implemented that would meet the above requirements is exp and pow, except that those (or at least exp, didn't check pow's implementation) are implemented by just doing a series of scalar exp operations, defeating the point of having a faster-than-scalar math library. Also, glam only supports up to vec4, whereas Kazan needs generic vector lengths of up to 64.

Thanks anyway!

programmerjake commented 3 years ago

Got all the vector traits wired up for use with scalars (where all vectors are length 1 for testing purposes or otherwise) and with generating demo compiler IR. Still need to wire up std::simd support.

https://salsa.debian.org/Kazan-team/vector-math/-/blob/7975aa9639f3a5a702b130a7cf992ffe71c86e2a/src/ir.rs#L1547 The following Rust:

fn f<Ctx: Context>(ctx: Ctx, a: Ctx::VecU8, b: Ctx::VecF32) -> Ctx::VecF64 {
    let a: Ctx::VecF32 = a.into();
    (a - (a + b - ctx.make(5f32)).floor()).to()
}

(the to() function is a replacement for the as keyword)

Generates the following demo IR:

function(in<arg_0>: vec<U8>, in<arg_1>: vec<F32>) -> vec<F64> {
    op_0: vec<F32> = Cast in<arg_0>
    op_1: vec<F32> = Add op_0, in<arg_1>
    op_2: vec<F32> = Sub op_1, splat(0x40A00000_f32)
    op_3: vec<F32> = Floor op_2
    op_4: vec<F32> = Sub op_0, op_3
    op_5: vec<F64> = Cast op_4
    Return op_5
}

Opinions on ease of use for writing functions like f?

programmerjake commented 3 years ago

Implemented ilogb. Started implementing the std::simd bindings, only to discover that I forgot that boolean vectors need different types for different element sizes.

programmerjake commented 3 years ago

I added bindings for std::simd: https://salsa.debian.org/Kazan-team/vector-math/-/commit/d77893b257c86d7ddc81f3772b5e143ce768d291 Looking at the assembly generated for stdsimd::tests::do_ilogb_f32x4, it looks pretty reasonable, everything's inlined, however there is still some branching that llvm didn't optimize out:

assembly for stdsimd::tests::do_ilogb_f32x4

```asm _ZN11vector_math7stdsimd5tests14do_ilogb_f32x417h16d8f5e11aac2a22E: .cfi_startproc subq $24, %rsp .cfi_def_cfa_offset 32 vbroadcastss .LCPI82_0(%rip), %xmm1 xorl %eax, %eax vmovaps %xmm1, (%rsp) .p2align 4, 0x90 .LBB82_1: cmpq $16, %rax je .LBB82_4 cmpl $32, (%rsp,%rax) leaq 4(%rax), %rax jb .LBB82_1 jmp .LBB82_3 .LBB82_4: xorl %eax, %eax vmovaps %xmm1, (%rsp) .p2align 4, 0x90 .LBB82_5: cmpq $16, %rax je .LBB82_7 cmpl $32, (%rsp,%rax) leaq 4(%rax), %rax jb .LBB82_5 jmp .LBB82_3 .LBB82_7: xorl %eax, %eax vmovaps %xmm1, (%rsp) .p2align 4, 0x90 .LBB82_8: cmpq $16, %rax je .LBB82_10 cmpl $32, (%rsp,%rax) leaq 4(%rax), %rax jb .LBB82_8 .LBB82_3: leaq .L__unnamed_116(%rip), %rdi leaq .L__unnamed_117(%rip), %rdx movl $30, %esi callq *_ZN4core9panicking5panic17h3de4db67bd397eb3E@GOTPCREL(%rip) ud2 .LBB82_10: vbroadcastss .LCPI82_1(%rip), %xmm1 vbroadcastss .LCPI82_3(%rip), %xmm5 vbroadcastss .LCPI82_4(%rip), %xmm6 vbroadcastss .LCPI82_5(%rip), %xmm7 vpbroadcastd .LCPI82_2(%rip), %xmm2 vxorps %xmm3, %xmm3, %xmm3 vcmpunordps %xmm3, %xmm0, %xmm4 vmulps %xmm1, %xmm0, %xmm1 vblendvps %xmm4, %xmm5, %xmm6, %xmm4 vandps %xmm6, %xmm0, %xmm6 vpbroadcastd .LCPI82_5(%rip), %xmm5 vcmpltps %xmm7, %xmm6, %xmm6 vpsrld $23, %xmm0, %xmm7 vpsrld $23, %xmm1, %xmm1 vpand %xmm2, %xmm1, %xmm1 vpand %xmm2, %xmm7, %xmm2 vpbroadcastd .LCPI82_6(%rip), %xmm7 vpand %xmm5, %xmm0, %xmm5 vcmpeqps %xmm3, %xmm0, %xmm0 vpcmpeqd %xmm3, %xmm5, %xmm5 vpaddd %xmm7, %xmm2, %xmm2 vblendvps %xmm6, %xmm2, %xmm4, %xmm2 vpbroadcastd .LCPI82_7(%rip), %xmm6 vbroadcastss .LCPI82_8(%rip), %xmm4 vpaddd %xmm6, %xmm1, %xmm1 vblendvps %xmm0, %xmm4, %xmm1, %xmm0 vblendvps %xmm5, %xmm0, %xmm2, %xmm0 vmovaps %xmm0, (%rdi) addq $24, %rsp .cfi_def_cfa_offset 8 retq ```

Compiled using:

cargo rustc --release --tests --features=ir,fma,stdsimd -- --emit=asm -C target-cpu=native

on a Ryzen 3900X on Ubuntu

programmerjake commented 3 years ago

I added implementations of sin_pi and cos_pi for f16 and f32, I tested all possible inputs, the functions are accurate to 2ULP though aren't guaranteed to give the correct sign for zeros.

Testing the f32 functions required writing a custom reference implementation of sin_cos_pi since (x * f64::consts::PI).sin_cos() isn't accurate enough when the exact mathematical value of either sin or cos is close to zero (e.g. sin_pi(1.0) == 0.0 but f64::consts::PI.sin() != 0.0 due to round-off error).

https://salsa.debian.org/Kazan-team/vector-math/-/blob/d79f43bed2398cbc4f6b75b8e55ee317289599a1/src/algorithms/trig_pi.rs#L180

programmerjake commented 3 years ago

I added sin_pi and cos_pi for f64, as well as adding abs, copy_sign, and trunc for all of f16/f32/f64.

programmerjake commented 3 years ago

I added round_to_nearest_ties_to_even, ceil, floor, and tan_pi

programmerjake commented 3 years ago

Added count_leading_zeros, count_trailing_zeros, and count_ones for all of u8/16/32/64 and i8/16/32/64

calebzulawski commented 3 years ago

What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?

programmerjake commented 3 years ago

What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?

Well, I'm hoping to implement more than just transcendental float functions, I'm aiming more for all the non-trivial library functions (more or less). Also, idk if cranelift has built-in support for bit counting functions. The bit counting functions are in the list in #14...

bjorn3 commented 3 years ago

Also, idk if cranelift has built-in support for bit counting functions.

Cranelift has the clz and ctz instructions. They don't seem to support vectors yet though. Wasm doesn't have SIMD bit counting functions yet: https://github.com/WebAssembly/simd/issues/6

andy-thomason commented 3 years ago

Apologies for popping in as the new guy.

If you are interested. I'm writing a crate called doctor_syn as part of the extendr R language extension project.

This is a small computer algebra system that operates on Rust expressions and amongst other things generates polynomial approximations for arbitrary expressions, such as sin or cos, or anything you can express as a series or integral.

The intention is to generate sets of trancendental and stats functions to variable precision.

The nice thing is that the compiler can generate functions "to order" in procedural macros. But you could just run it to a fixed number of terms for this limited use case.

In games we often had "high throughput" and "low latency" variants.

calebzulawski commented 3 years ago

I think we have two choices here--we can implement bit-counting functions directly in rust if we don't think any architectures have optimized SIMD implementations--but then that should be implemented directly in stdsimd. If there are specific instructions for them we should use the compiler codegen, which means cranelift would need to implement it for SIMD eventually. Either way, I don't think it belongs in a separate library, which just acts as a libc alternative.

Artoria2e5 commented 3 years ago

I seem to recall from before that a vector math library called SLEEF was also seeking to be integrated into LLVM. How is vector-math going to be different from it?

thomcc commented 3 years ago

Our requirements (primarily the part where we want rustc to be able to inline the functions) lead to it needing to be implemented in Rust, which is a big difference.

programmerjake commented 3 years ago

Also, there's an abstraction layer allowing the vector math functions to generate compiler IR instead of doing the operations directly (not required for core::simd, but very useful for Kazan and other projects that implement compilers)

andy-thomason commented 3 years ago

Sorry to pop up again.

I'm trying to refine my best attempt at the trig functions which are very easy as they are periodic and well behaved, ie. converge quickly as McLaurin series.

Here is a version for f32, but SIMD versions should be the same.

fn sin(x: f32) -> f32 {
    let x = x * (1.0 / (std::f32::consts::PI * 2.0));
    let x = x - x.floor() - 0.5;
    12.268859941019306_f32
        .mul_add(x * x, -41.216241051002875_f32)
        .mul_add(x * x, 76.58672703334098_f32)
        .mul_add(x * x, -81.59746095374902_f32)
        .mul_add(x * x, 41.34151143437585_f32)
        .mul_add(x * x, -6.283184525811273_f32)
        * x
}

This gives a short instruction sequence with moderate latency (All 4 cycles per instruction on Skylake).

example::sin:
        vmulss  xmm0, xmm0, dword ptr [rip + .LCPI0_0]
        vroundss        xmm1, xmm0, xmm0, 9
        vsubss  xmm0, xmm0, xmm1
        vaddss  xmm0, xmm0, dword ptr [rip + .LCPI0_1]
        vmulss  xmm1, xmm0, xmm0
        vmovss  xmm2, dword ptr [rip + .LCPI0_2]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_3]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_4]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_5]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_6]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_7]
        vmulss  xmm0, xmm0, xmm2
        ret

ie. About 48 cycle latency with a 6 cycle throughput.

The method uses the Newton polynomial method I mentioned with special attention to the endpoints.

The table-makers dilemma limits us to 2-3 ulp (measured from the maximum of +/- 1.0) and adding more terms does not improve this - and in fact may make it worse.

The same kernel executed in f64, convered to f32 gives < 0.5 ulp for the full 0..PI*2 range. Adding more terms to the f64 expansion will improve f32 ULP, but not f64 ulp which requires some nasty tricks to solve the TMD.

I've been planning to write a paper on this and may get around to it in some future life.

Note that we can approximate the whole range in a single polynomial. Using sin-cos and select shortens the range but reduces throughput.

It is expected that in a loop the compiler will put together 4 or more similar operations:

pub fn f(x: &mut [f32]) {
    for v in x {
        *v = sin(*v);
    }
}

Does in fact do this.

Let me know what you think. I can make some similar kernels for the other "standard" functions if you are interested.

programmerjake commented 3 years ago

Implemented sqrt_fast_f16/f32/f64 which are accurate to 2/3/2 ULPs respectively. No libm needed!

andy-thomason commented 3 years ago

I am working on sin, cos, exp, log.

I'm hoping we can do tan, sec, csc without divides.

For inverse trig, a good atan2 is a good place to start as all the rest can be derived from this.

I can go through the doctor_syn code if you want to do some more.

andy-thomason commented 3 years ago

Could someone help me to add these implementations. I am also very happy to support other doing this,

Where should they go? How do we test them formally etc.

Any suggestions welcome.

andy-thomason commented 3 years ago

I've added a PR #126 @programmerjake @calebzulawski

I've only added the sin(f32) function as a placeholder, but I can generate others using libmgen in the doctor_syn crate (contributors welcome).

programmerjake commented 3 years ago

Could someone help me to add these implementations. I am also very happy to support other doing this,

Where should they go?

I'd advocate for them going in https://salsa.debian.org/Kazan-team/vector-math which I'm planning on merging into core kinda like compiler-builtins is. They should be built to use the abstraction layer in crate::traits::Context (best viewed using rustdocs), example: https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/base.rs#L57

How do we test them formally etc.

For f16/f32 functions, it's feasible to test all possible inputs for unary functions; for f64, I'm just making-do with testing every Nth input bit-pattern. https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/trig_pi.rs#L884

I have a dedicated CI runner for vector-math (shared between all Kazan and Libre-SOC projects), so I just run the tests for an hour. https://salsa.debian.org/Kazan-team/vector-math/-/pipelines/257995/builds

programmerjake commented 3 years ago

Started thread on llvm-dev: https://lists.llvm.org/pipermail/llvm-dev/2021-June/150965.html

andy-thomason commented 3 years ago

We now have most of the basic libm functions in scalar form.

https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs

I need to add a scalar to simd transform, more accurate versions, and some of the "pedantry" of the various standards as optional extras.

Examples are preserving the input of sin(x) for low magnitude of x. ie. sin(x) = x for small x including -0 and +0.

Different standards disagree on where the errors should occur and others leave it quite open. The game industry, for example, does not need small value corrections.

I have also considered generating LLVM code to generate the IR.

LowLevelMahn commented 3 years ago

is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions? for example in speed and accuracy, on multiple platforms?

programmerjake commented 3 years ago

We now have most of the basic libm functions in scalar form.

https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs

Sorry, some of those functions are horribly broken: exp2(384.0) returns -Inf! it should never return a negative value, much less negative infinity! you can also get it to produce NaN and most any f32 value.

programmerjake commented 3 years ago

is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions? for example in speed and accuracy, on multiple platforms?

not yet...there is a test suite that compares it's accuracy against the correct mathematical results (or as close as is easyish to get -- it tests f16/f32 against the f64 function from std rounded to f16/f32, the f64 versions are tested against the rug bindings to MPFR, which is known to produce correct results). The functions are documented how accurate they are -- if they don't have documentation they should be 100% accurate (such as copy_sign), assuming I didn't miss anything.

mike-barber commented 3 years ago

On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning.

Having different implementations available could be quite useful, e.g.

accurate
fast
really fast and dirty

For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here and here on a fast exp implementation useful for imprecise things like calculating a softmax quickly. This example is probably somewhere in the fast or really fast and dirty region, and I wouldn't suggest re-using it as corners have been cut :)

I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients.

andy-thomason commented 3 years ago

We should support different accuracies and levels of pedantry if we want game Devs to use the functions.

My current libm is pedantry free and good to a fixed max error which is expected game dev behaviour.

We can add more cycles to make the library POSIX compliant, but this is not desirable behaviour for the expert user and so should be configurable.

An example is cos and sin which use a lookup table in POSIX impls to handle four quadrants. This carries a very high cost to Get one more bit of accuracy, possibly four times slower.

On Sun, 13 Jun 2021, 11:48 Michael Barber, @.***> wrote:

On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning.

Having different implementations available could be quite useful, e.g.

accurate

fast

really fast and dirty

For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx_avx.rs and here https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx.rs on a fast exp implementation useful for imprecise things like calculating a softmax quickly. This example is probably somewhere in the fast or really fast and dirty region, and I wouldn't suggest re-using it as corners have been cut :)

I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rust-lang/stdsimd/issues/109#issuecomment-860190337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XFFZ6HLGF7YGK4PZRLTSSEH7ANCNFSM43RZE7PA .

programmerjake commented 3 years ago

An example is cos and sin which use a lookup table in POSIX impls to handle four quadrants. This carries a very high cost to Get one more bit of accuracy, possibly four times slower.

The sin_pi and related functions I have implemented so far use two polynomial evaluations, then use select to get the right sign and to switch between sin and cos. On CPUs with a pipelined FMA or pipelined mul and add units (basically everything modern other than microcontrollers), the polynomials can run in just 2-3 more cycles (not tested, but I am a cpu designer for Libre-SOC) than a single polynomial evaluation, since the FMAs run interleaved with each other.

andy-thomason commented 3 years ago

You are right, this is a good way to get an accurate cos and sin and indeed has lower latency. If you are calling a function occasionally, you want to optimise for latency over throughput.

If you are in a loop that has been unrolled, each of the operations will be done four or more times by the compiler to fill in the gaps between the instructions (latency/cpi) increasing throughput. For this case, having fewer instructions is best and we may be willing to sacrifice the last two bits.

I wouldn't want to call what would be preferred. Even the modulus and scaling steps would be considered optional if you stored Euler angles as small integers, for example.

In practice, making a super fast function will be rendered useless by the memory access speed which will always slow you down in the loop case and the I-cache performance in the "single" use case. Big game engines are nearly always I-cache bound, I discovered at Sony. We spent a lot of time making the instructions shorter, but LLVM does not do this because all their tests are small loops.

andy-thomason commented 3 years ago

I was inspired by @programmerjake to try the quadrant version of sin/cos

The conditional negation of the cos and sin results is a bit hacky and could be replaced by val ^ ((x & 1) << 31)

As a result of the shorter range, we can also reduce the number of terms and so it may be as fast as the "fast" version anyway!

The godbolt output looks not too bad, although LLVM is not scheduling the result optimally.

I also need to add a simdifying transform to the functions.

fn sin(x: f32) -> f32 {
    let x = x * (1.0 / (std::f32::consts::PI));
    let xh = x + 0.5;
    let xr = x.round();
    let xhr = xh.round();
    let s = x - xr;
    let c = xh - xhr;
    let sr = (-0.5878112252588481_f32)
        .mul_add(s * s, 2.5496484487855455_f32)
        .mul_add(s * s, -5.167705042516404_f32)
        .mul_add(s * s, 3.1415926342503133_f32)
        * s;
    let cr = (0.23132925050778008_f32)
        .mul_add(c * c, -1.335050718961723_f32)
        .mul_add(c * c, 4.0587078433143_f32)
        .mul_add(c * c, -4.934802176219238_f32)
        .mul_add(c * c, 0.9999999999999999_f32);
    let ss = if (xr as i32) & 1 == 0 { sr } else { -sr };
    let cs = if (xhr as i32 & 1) == 0 { -cr } else { cr };
    if s.abs() <= 0.25 {
        ss
    } else {
        cs
    }
}

rust-lang / portable-simd

Create a Vector Math library to allow `SimdF32::sin` and similar to work in `core` #109