Hello,
While working on performance for the llama example in candle, we've noticed that performance of the matmul bits based on this crate where a lot worse for f16 than for f32 (by a factor of ~4x). This seems to be due to no vectorization being done on the f16 side and pack_lhs was the worst affected part in the flamegraphs that we generated.
This PR updates the f16 version of pack_generic_inner_loop with optimizations similar to the version from gemm_common by using the vectorized slice conversions from the half crate (docs). This results in a ~3x speedup of the token generation in our llama example (going from ~5.9s to ~2.0s per token on a Ryzen 5 2600X with 32G of memory, single thread, llama 7B) with the results being unchanged.
I'm certainly not familiar at all with the codebase of this crate so there might be good reasons not to have this kind of optimization but if it's easy to get that in, this would be great as it would put our example close to llama.cpp in terms of performance.
Hello, While working on performance for the llama example in candle, we've noticed that performance of the matmul bits based on this crate where a lot worse for
f16
than forf32
(by a factor of ~4x). This seems to be due to no vectorization being done on thef16
side andpack_lhs
was the worst affected part in the flamegraphs that we generated. This PR updates thef16
version ofpack_generic_inner_loop
with optimizations similar to the version fromgemm_common
by using the vectorized slice conversions from thehalf
crate (docs). This results in a ~3x speedup of the token generation in our llama example (going from ~5.9s to ~2.0s per token on a Ryzen 5 2600X with 32G of memory, single thread, llama 7B) with the results being unchanged. I'm certainly not familiar at all with the codebase of this crate so there might be good reasons not to have this kind of optimization but if it's easy to get that in, this would be great as it would put our example close to llama.cpp in terms of performance.Tagging @narsil.