raphlinus / font-rs

Apache License 2.0
753 stars 49 forks source link

Rewrite SIMD intrinsic with Rust? #9

Closed louy2 closed 5 years ago

louy2 commented 7 years ago

Ref: https://github.com/rust-lang/rfcs/issues/1639

kpp commented 6 years ago

@AdamNiederer that's a good challenge for your faster library!

6D65 commented 5 years ago

This is a pretty much mot-a-mot translation into Rust SS3 intrinsics. It compiles in stable and passes all the accumulate tests.

Can make a pull request, or @rtsuk || @raphlinus can add this to make it quicker. Want to make an AVX2 version of this, to accumulate 8 values at once, don't see any reason it won't work.


    use std::mem;

    #[cfg(target_arch = "x86_64")]
     use std::arch::x86_64::*;

    #[cfg(target_arch = "x86")]
    use std::arch::x86::*;

    macro_rules! _mm_shuffle {
        ($z:expr, $y:expr, $x:expr, $w:expr) => {
            ($z << 6) | ($y << 4) | ($x << 2) | $w
        };
    }

    #[inline]
    #[target_feature(enable = "sse3")]
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    #[allow(unsafe_code)]
    pub unsafe fn accumulate_sse(input: &[f32], out: &mut Vec<u8>, n: usize) {
        let mut offset = _mm_setzero_ps();
        let sign_mask = _mm_set1_ps(-0.);
        let mask = _mm_set1_epi32(0x0c080400);

        for i in (0..n).step_by(4) {
            let mut x = _mm_loadu_ps(&input[i]);
            x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
            x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
            x = _mm_add_ps(x, offset);

            let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
            y = _mm_min_ps(y, _mm_set1_ps(1.0));
            y = _mm_mul_ps(y, _mm_set1_ps(255.0));

            let mut z = _mm_cvttps_epi32(y);
            z = _mm_shuffle_epi8(z, mask);

            _mm_store_ss(mem::transmute(&out[i]), _mm_castsi128_ps(z));
            offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
        }
    }

    fn accumulate(src: &[f32]) -> Vec<u8> {
        let len = src.len();
        let n = (len + 3) & !3; // align data
        let mut dst: Vec<u8> = vec![0; n]; // Vec::with_capacity(n) won't work here
        unsafe {
            accumulate_sse(src, &mut dst, n);
            dst.set_len(len); // we must return vec of the same length as src.len()
        }
        dst
    }

or merge the simd function with the top level one

use std::mem;

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

#[cfg(target_arch = "x86")]
use std::arch::x86::*;

macro_rules! _mm_shuffle {
    ($z:expr, $y:expr, $x:expr, $w:expr) => {
        ($z << 6) | ($y << 4) | ($x << 2) | $w
    };
}

#[inline]
#[cfg(feature = "sse")]
#[allow(unsafe_code)]
pub unsafe fn accumulate(src: &[f32]) -> Vec<u8> {
    // SIMD instructions force us to align data since we iterate each 4 elements
    // So:
    // n (0) => 0
    // n (1 or 2 or 3 or 4) => 4,
    // n (5) => 8
    // and so on
    let len = src.len();
    let n = (len + 3) & !3; // align data
    let mut dst: Vec<u8> = vec![0; n];
    let mut offset = _mm_setzero_ps();
    let sign_mask = _mm_set1_ps(-0.);
    let mask = _mm_set1_epi32(0x0c080400);

    for i in (0..n).step_by(4) {
        let mut x = _mm_loadu_ps(&src[i]);
        x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
        x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
        x = _mm_add_ps(x, offset);

        let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
        y = _mm_min_ps(y, _mm_set1_ps(1.0));
        y = _mm_mul_ps(y, _mm_set1_ps(255.0));

        let mut z = _mm_cvttps_epi32(y);
        z = _mm_shuffle_epi8(z, mask);

        _mm_store_ss(mem::transmute(&dst[i]), _mm_castsi128_ps(z));
        offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
    }

    dst.set_len(len); // we must return vec of the same length as src.len()

    dst
}
rtsuk commented 5 years ago

I defer to @raphlinus on this one.

raphlinus commented 5 years ago

@6D65 On a quick skim, that looks good. I would definitely prefer a PR rather than trying to adapt it from this issue. Maybe the 128-bit one first, with a followup for the AVX (including benchmarks); the latter will certainly require more sophisticated run-time capability testing.

6D65 commented 5 years ago

@raphlinus submitted a pull request from my other account. I have left the feature sse in place, I assume it's handy for benchmarking simd vs non-simd, but other than that, can't think of a reason to keep it.

Also, I modified the render example to dump bmp files, as pgm is not quite supported on windows. Can make a pull request for that as well.

johannesvollmer commented 5 years ago

May I ask why this Issue is still open? Are we waiting for avx support before closing? Changing the render example from pgm to bmp should probably be a separate issue.

louy2 commented 5 years ago

Sorry, didn't realize this is fixed. Closing.