huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.88k stars 962 forks source link

*Major T/s improvement* Use the Metal qmatmul MM kernels #2615

Open EricLBuehler opened 1 week ago

EricLBuehler commented 1 week ago

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

EricLBuehler commented 1 week ago

@LaurentMazare if you could review, that would be great!

More benchmarks with some smaller models can be found here: https://github.com/EricLBuehler/mistral.rs/issues/903#issuecomment-2477442513