Have we considered optimize the distance calculation with BLAS & OpenMP vs. SSE/AVX?
The BLAS (e.g.MLK) operations are deeply optimized, and they might be able to outperform our current code with SSE/AVX given it includes ad hoc dimension check (divide-able by 4/16).
Yes, this can be done, but I do not remember if I've tried it.
This would add a big dependence, as a minus.
If somebody could do it as an optional feature, that would be awesome!
Have we considered optimize the distance calculation with BLAS & OpenMP vs. SSE/AVX?
The BLAS (e.g.MLK) operations are deeply optimized, and they might be able to outperform our current code with SSE/AVX given it includes ad hoc dimension check (divide-able by 4/16).