Unroll loops over vector parts

Even though SIMD algorithms use native vector width most of the time (e.g. float32<8> on AVX2), there are plenty of times when one needs to use a wider vector. This happens even on architectures with a single vector width for all types.

For example, consider an algorithm that is uint16<16> for most of the execution, but needs to use uint32<16> at the end. On AVX this will use single native vector for uint16<16>, but two native vectors for uint32<16> part. The developer should just be able to use uint32<16> and get optimal performance even when the type maps to more than 1 native vector.

Previous implementation of accessing native vector sub-parts by index is not properly optimized in many cases. Using unrolling via templates fixes this completely.

p12tic / libsimdpp

Unroll loops over vector parts #171