Even though SIMD algorithms use native vector width most of the time (e.g. float32<8> on AVX2), there are plenty of times when one needs to use a wider vector. This happens even on architectures with a single vector width for all types.
For example, consider an algorithm that is uint16<16> for most of the execution, but needs to use uint32<16> at the end. On AVX this will use single native vector for uint16<16>, but two native vectors for uint32<16> part. The developer should just be able to use uint32<16> and get optimal performance even when the type maps to more than 1 native vector.
Previous implementation of accessing native vector sub-parts by index is not properly optimized in many cases. Using unrolling via templates fixes this completely.
Even though SIMD algorithms use native vector width most of the time (e.g. float32<8> on AVX2), there are plenty of times when one needs to use a wider vector. This happens even on architectures with a single vector width for all types.
For example, consider an algorithm that is uint16<16> for most of the execution, but needs to use uint32<16> at the end. On AVX this will use single native vector for uint16<16>, but two native vectors for uint32<16> part. The developer should just be able to use uint32<16> and get optimal performance even when the type maps to more than 1 native vector.
Previous implementation of accessing native vector sub-parts by index is not properly optimized in many cases. Using unrolling via templates fixes this completely.