It is interesting to observe that clang produces code that is ~6X faster than the GCC result on a simple benchmark that measures the speed of the best_index_iq4n function (which is the bottleneck during IQ4_K quantization). But when this is used in practice in quantize_row_iq4_k_impl_bs16, the clang executable is actually worse than the GCC executable. Either way, both compilers need a hand, so this PR gives it to them. This gives us a ~2X speedup in the IQ4_K quantization.
It is interesting to observe that
clang
produces code that is ~6X faster than theGCC
result on a simple benchmark that measures the speed of thebest_index_iq4n
function (which is the bottleneck duringIQ4_K
quantization). But when this is used in practice inquantize_row_iq4_k_impl_bs16
, theclang
executable is actually worse than theGCC
executable. Either way, both compilers need a hand, so this PR gives it to them. This gives us a ~2X speedup in theIQ4_K
quantization.