Open peabody-korg opened 7 years ago
Proper solution for this issue would be to implement float32<2>
support, which is quite a lot of changes. As a workaround a function that just converts lower lanes of float32<4>
to float64<2>
could be written, but it would make the public API inconsistent. If included into the library, the function should go into simdpp::unsupported
or similar namespace :-)
Opened issue #93 for small vector support.
small vector support would be useful for all kinds of things! no rush though. I have a VectorUtil namespace full of my own extensions. Here's what I have for this particular case:
namespace VectorUtil {
// return { x[0], x[1] }
inline simdpp::float64<2> HalfVectorToFloat64(simdpp::float32<4> x)
{
#if SIMDPP_USE_SSE2 // also catches AVX
return _mm_cvtps_pd(x.native());
#elif SIMDPP_USE_NEON64
return vcvt_f64_f32(vget_low_f32(x.native()));
#else
Vector::Float64<4> u = simdpp::to_float64(x);
Vector::Float64<2> r, dummy;
simdpp::split(u, r, dummy);
return r;
#endif
}
// return { x[0], x[1], -, - }
inline simdpp::float32<4> HalfVectorToFloat32(simdpp::float64<2> x)
{
#if SIMDPP_USE_SSE2 // also catches AVX
return _mm_cvtpd_ps(x.native());
// #elif SIMDPP_USE_NEON64
// unable to find an A64 intrinsic version that looks any better than the default
#else
return simdpp::to_float32(simdpp::combine(x,x));
#endif
}
}
It's sufficient for our project needs, and has been tested on clang (intel) and gcc (intel, a64). So I wouldn't actually need anything added to the library for this for now.
Regarding the NEON64 HalfVectorToFloat32() case, I just gave up after a while. I'm not that skilled at NEON intrinsics yet. Seems like there ought to be something equivalent to SSE2 _mm_cvtpd_ps(), but the conversions between float32x2_t and float32x4_t were thwarting any attempt at optimization. Although I managed to get gcc to emit a single 2-lane narrowing instruction, it was surrounded by what looked like a lot of unnecessary move instructions.
support for effectively doing float32<2> <-> float64<2>, but actually operating on the 2 lower lanes of a float32<4>. Currently the narrowest conversion operation is float32<4> <-> float64<4>, which results in emitting extra instructions for SSE and NEON if all you want is the 2 lower lanes.