This PR lowers the requirement for 256-bit wide vectors on x86/x86_64 platforms from AVX2 to AVX. #86 mistakenly assumes all of the operations are not available until AVX2, while in reality operations working on __m256d were viable from the start. The main difference is that the code doesn't really use the floating point operations on the type, so the two can be treated the same.
Performance-wise, benchmarks were run and there were zero shown deviations in performance between AVX and AVX2 other than some other incidental speedups in non-set/batch operations.
Used the opportunity to clean up the block directory and deduplicate repeated code and clean up some of the cfg attributes. Also added the new compilation configurations to CI.
This PR lowers the requirement for 256-bit wide vectors on x86/x86_64 platforms from AVX2 to AVX. #86 mistakenly assumes all of the operations are not available until AVX2, while in reality operations working on
__m256d
were viable from the start. The main difference is that the code doesn't really use the floating point operations on the type, so the two can be treated the same.Performance-wise, benchmarks were run and there were zero shown deviations in performance between AVX and AVX2 other than some other incidental speedups in non-set/batch operations.
Used the opportunity to clean up the block directory and deduplicate repeated code and clean up some of the cfg attributes. Also added the new compilation configurations to CI.