Closed louy2 closed 5 years ago
This is a pretty much mot-a-mot translation into Rust SS3 intrinsics. It compiles in stable and passes all the accumulate tests.
Can make a pull request, or @rtsuk || @raphlinus can add this to make it quicker. Want to make an AVX2 version of this, to accumulate 8 values at once, don't see any reason it won't work.
use std::mem;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
macro_rules! _mm_shuffle {
($z:expr, $y:expr, $x:expr, $w:expr) => {
($z << 6) | ($y << 4) | ($x << 2) | $w
};
}
#[inline]
#[target_feature(enable = "sse3")]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[allow(unsafe_code)]
pub unsafe fn accumulate_sse(input: &[f32], out: &mut Vec<u8>, n: usize) {
let mut offset = _mm_setzero_ps();
let sign_mask = _mm_set1_ps(-0.);
let mask = _mm_set1_epi32(0x0c080400);
for i in (0..n).step_by(4) {
let mut x = _mm_loadu_ps(&input[i]);
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
x = _mm_add_ps(x, offset);
let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
y = _mm_min_ps(y, _mm_set1_ps(1.0));
y = _mm_mul_ps(y, _mm_set1_ps(255.0));
let mut z = _mm_cvttps_epi32(y);
z = _mm_shuffle_epi8(z, mask);
_mm_store_ss(mem::transmute(&out[i]), _mm_castsi128_ps(z));
offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
}
}
fn accumulate(src: &[f32]) -> Vec<u8> {
let len = src.len();
let n = (len + 3) & !3; // align data
let mut dst: Vec<u8> = vec![0; n]; // Vec::with_capacity(n) won't work here
unsafe {
accumulate_sse(src, &mut dst, n);
dst.set_len(len); // we must return vec of the same length as src.len()
}
dst
}
or merge the simd function with the top level one
use std::mem;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
macro_rules! _mm_shuffle {
($z:expr, $y:expr, $x:expr, $w:expr) => {
($z << 6) | ($y << 4) | ($x << 2) | $w
};
}
#[inline]
#[cfg(feature = "sse")]
#[allow(unsafe_code)]
pub unsafe fn accumulate(src: &[f32]) -> Vec<u8> {
// SIMD instructions force us to align data since we iterate each 4 elements
// So:
// n (0) => 0
// n (1 or 2 or 3 or 4) => 4,
// n (5) => 8
// and so on
let len = src.len();
let n = (len + 3) & !3; // align data
let mut dst: Vec<u8> = vec![0; n];
let mut offset = _mm_setzero_ps();
let sign_mask = _mm_set1_ps(-0.);
let mask = _mm_set1_epi32(0x0c080400);
for i in (0..n).step_by(4) {
let mut x = _mm_loadu_ps(&src[i]);
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
x = _mm_add_ps(x, offset);
let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
y = _mm_min_ps(y, _mm_set1_ps(1.0));
y = _mm_mul_ps(y, _mm_set1_ps(255.0));
let mut z = _mm_cvttps_epi32(y);
z = _mm_shuffle_epi8(z, mask);
_mm_store_ss(mem::transmute(&dst[i]), _mm_castsi128_ps(z));
offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
}
dst.set_len(len); // we must return vec of the same length as src.len()
dst
}
I defer to @raphlinus on this one.
@6D65 On a quick skim, that looks good. I would definitely prefer a PR rather than trying to adapt it from this issue. Maybe the 128-bit one first, with a followup for the AVX (including benchmarks); the latter will certainly require more sophisticated run-time capability testing.
@raphlinus submitted a pull request from my other account. I have left the feature sse
in place, I assume it's handy for benchmarking simd vs non-simd, but other than that, can't think of a reason to keep it.
Also, I modified the render example to dump bmp files, as pgm is not quite supported on windows. Can make a pull request for that as well.
May I ask why this Issue is still open? Are we waiting for avx support before closing? Changing the render example from pgm to bmp should probably be a separate issue.
Sorry, didn't realize this is fixed. Closing.
Ref: https://github.com/rust-lang/rfcs/issues/1639