optimized widening/narrowing operations

peabody-korg commented 7 years ago

support for effectively doing float32<2> <-> float64<2>, but actually operating on the 2 lower lanes of a float32<4>. Currently the narrowest conversion operation is float32<4> <-> float64<4>, which results in emitting extra instructions for SSE and NEON if all you want is the 2 lower lanes.

p12tic commented 7 years ago

Proper solution for this issue would be to implement float32<2> support, which is quite a lot of changes. As a workaround a function that just converts lower lanes of float32<4> to float64<2> could be written, but it would make the public API inconsistent. If included into the library, the function should go into simdpp::unsupported or similar namespace :-)

p12tic commented 7 years ago

Opened issue #93 for small vector support.

peabody-korg commented 7 years ago

small vector support would be useful for all kinds of things! no rush though. I have a VectorUtil namespace full of my own extensions. Here's what I have for this particular case:

namespace VectorUtil {
    // return { x[0], x[1] }
    inline simdpp::float64<2> HalfVectorToFloat64(simdpp::float32<4> x)
    {
    #if SIMDPP_USE_SSE2     // also catches AVX
        return _mm_cvtps_pd(x.native());
    #elif SIMDPP_USE_NEON64
        return vcvt_f64_f32(vget_low_f32(x.native()));
    #else
        Vector::Float64<4> u = simdpp::to_float64(x);
        Vector::Float64<2> r, dummy;
        simdpp::split(u, r, dummy);
        return r;
    #endif
    }

    // return { x[0], x[1], -, - }
    inline simdpp::float32<4> HalfVectorToFloat32(simdpp::float64<2> x)
    {
    #if SIMDPP_USE_SSE2     // also catches AVX
        return _mm_cvtpd_ps(x.native());
//  #elif SIMDPP_USE_NEON64
        // unable to find an A64 intrinsic version that looks any better than the default
    #else
        return simdpp::to_float32(simdpp::combine(x,x));
    #endif
    }
}

It's sufficient for our project needs, and has been tested on clang (intel) and gcc (intel, a64). So I wouldn't actually need anything added to the library for this for now.

Regarding the NEON64 HalfVectorToFloat32() case, I just gave up after a while. I'm not that skilled at NEON intrinsics yet. Seems like there ought to be something equivalent to SSE2 _mm_cvtpd_ps(), but the conversions between float32x2_t and float32x4_t were thwarting any attempt at optimization. Although I managed to get gcc to emit a single 2-lane narrowing instruction, it was surrounded by what looked like a lot of unnecessary move instructions.

p12tic / libsimdpp

optimized widening/narrowing operations #92