This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.
Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.
This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.
Besides utilizing the MM kernels, this PR also upstreams some required changes:
This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.
Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.
This PR switches to only using the MV kernels when
D::Minus2
of thexs
input tensor is equal to 1. This mirrors the logic in GGML.Besides utilizing the MM kernels, this PR also upstreams some required changes: