sarah-quinones / gemm

MIT License
76 stars 12 forks source link

F16 vectorize pack #8

Closed LaurentMazare closed 7 months ago

LaurentMazare commented 1 year ago

Hello, While working on performance for the llama example in candle, we've noticed that performance of the matmul bits based on this crate where a lot worse for f16 than for f32 (by a factor of ~4x). This seems to be due to no vectorization being done on the f16 side and pack_lhs was the worst affected part in the flamegraphs that we generated. This PR updates the f16 version of pack_generic_inner_loop with optimizations similar to the version from gemm_common by using the vectorized slice conversions from the half crate (docs). This results in a ~3x speedup of the token generation in our llama example (going from ~5.9s to ~2.0s per token on a Ryzen 5 2600X with 32G of memory, single thread, llama 7B) with the results being unchanged. I'm certainly not familiar at all with the codebase of this crate so there might be good reasons not to have this kind of optimization but if it's easy to get that in, this would be great as it would put our example close to llama.cpp in terms of performance.

Tagging @narsil.

sarah-quinones commented 1 year ago

thanks for the PR! those speedups look impressive. i'm a bit busy with other things at the moment, but i can take a look at this in a couple weeks

LaurentMazare commented 7 months ago

This ended up being merged independently so all good now.