p12tic / libsimdpp

Portable header-only C++ low level SIMD library
Boost Software License 1.0
1.24k stars 129 forks source link

Unroll loops over vector parts #171

Closed p12tic closed 5 months ago

p12tic commented 5 months ago

Even though SIMD algorithms use native vector width most of the time (e.g. float32<8> on AVX2), there are plenty of times when one needs to use a wider vector. This happens even on architectures with a single vector width for all types.

For example, consider an algorithm that is uint16<16> for most of the execution, but needs to use uint32<16> at the end. On AVX this will use single native vector for uint16<16>, but two native vectors for uint32<16> part. The developer should just be able to use uint32<16> and get optimal performance even when the type maps to more than 1 native vector.

Previous implementation of accessing native vector sub-parts by index is not properly optimized in many cases. Using unrolling via templates fixes this completely.