*Major T/s improvement* Use the Metal qmatmul MM kernels

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

Adds GGUF bf16 support (originally)
Updates quantized Metal kernels to support bf16 (originally)
Sync GGML <> Candle Metal kernels (originally)

huggingface / candle

Major T/s improvement Use the Metal qmatmul MM kernels #2615

huggingface / candle

*Major T/s improvement* Use the Metal qmatmul MM kernels #2615

Major T/s improvement Use the Metal qmatmul MM kernels #2615