huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
13.79k stars 751 forks source link

Adding direct-F16 quantization #2136

Closed EricLBuehler closed 2 weeks ago

EricLBuehler commented 2 weeks ago

Hello all,

During our work on mistral.rs we have noticed that Candle only dequantizes to F32 whereas llama.cpp can dequantize to F16. This affects performance because on certain hardware, turing will be used over the slower volta matmul kernels when in F16. Are there any plans to add support for dequantizing to arbitrary floating point datatypes in the future?

For reference, here is our tracking issue: https://github.com/EricLBuehler/mistral.rs/issues/153

Thank you!

lucasavila00 commented 2 weeks ago

Some extra context (the numbers are of an RTX 2070)

A prompt of 512 tokens it processed at ~600t/s using the MMQ kernels.

If I force it to dequantize first, convert it to f16, then do the matmuls in f16, then convert it to f32 I can get candle to use the same kernels llama.cpp uses for prompt processing (I think? The names are almost the same). This runs at ~700t/s.

On this latter approach, 25% of the GPU time is spent doing f32 -> f16 conversion. Ideally we'd dequantize directly to f16 to reduce some of that workload.

This PR https://github.com/EricLBuehler/mistral.rs/pull/238 implements what I described above, and it contains comparisons between llama-bench and mistralrs-bench, and nvidia profiles of both applications.

These lines of llama.cpp do the same f32->f16 and matmul https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1232-L1270 that is called from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1959

LaurentMazare commented 2 weeks ago

That sounds like some pretty neat speedup to get. Is it just useful for cuda or also for cpu/metal?

EricLBuehler commented 2 weeks ago

I think this would be an optimization for CUDA.

LaurentMazare commented 2 weeks ago

Ok thanks, let me have a quick look I don't think that the kernels do any float specific magic so the conversion shouldn't be tricky.

LaurentMazare commented 2 weeks ago

See #2137 , I'm just going to add a bit of testing but this should be hopefully all fine.

LaurentMazare commented 2 weeks ago

2137 has been merged, I'll also put some small changes so that it's easier to control which version gets used in #2138 .

lucasavila00 commented 2 weeks ago

After direct f16 dequantization we're at 1000t/s https://github.com/EricLBuehler/mistral.rs/pull/238#issuecomment-2081624082

Thank you!

EricLBuehler commented 2 weeks ago

@LaurentMazare, thank you for adding this! We observe about a 60% performance increase for prompt processing.

It seems like the Candle matmul kernels here are slower than the llama.cpp ones overall, though by about 60%, which correlates with our prompt processing deficit to llama.cpp of also about 60%.

lucasavila00 commented 2 weeks ago

I created a new issue about the different kernels https://github.com/huggingface/candle/issues/2139