Quantized models on Cuda

EmilLindfors commented 1 year ago

Hello! Are there any plans on implementing quantized models on cuda devices? Would be great to be able to run the forthcoming 14b mistral on a 3090 with e.g. q_8.

LaurentMazare commented 1 year ago

Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

trigger-happy commented 1 year ago

Commenting here, I'd love to have Cuda + quantization support as well.

LLukas22 commented 1 year ago

I already created another issue, some time ago for this: https://github.com/huggingface/candle/issues/655

But i'm also very interested in getting cuda acceleration working for quantized tensors, but i think it would be wise to wait for https://github.com/huggingface/candle/pull/1230 to mature a bit as they already added the whole Device scaffolding to the quantized implementation, which we will also need to support cuda acceleration.

Other than that this should theoretically be relatively simple as the quantized cuda kernels already exist in the ggml \ llama.cpp projects. They even have some matmul kernels now in addition to the older vecdot kernels.

danielclough commented 12 months ago

Commenting to bump priority, as requested.

Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

Related open issues

Support for quantisation: https://github.com/huggingface/candle/issues/359 CUDA support for QMatMul: https://github.com/huggingface/candle/issues/655 Error: no cuda implementation for qmatmul: https://github.com/huggingface/candle/issues/696 You are here: https://github.com/huggingface/candle/issues/1250

miketang84 commented 10 months ago

come on, need it, thanks.

np33kf commented 10 months ago

This will be a game changer in running bigger LLM in consumer grade GPU as memory is the main constrain. Big thanks for the efforts of everyone. This is awesome framework! Love Rust and cannot stand to code in python ....

EricLBuehler commented 10 months ago

Looks like an exciting development.

miketang84 commented 9 months ago

Hi, guys, I ask for the progress on supporting the quantization on cuda politely, is there any new info? In the following days I have some time to help, if needed.

LLukas22 commented 9 months ago

@miketang84

Due to time constraints, I wasn't able to dive deeper into this. However, for enabling gguf quantizations with CUDA, essentially three steps are required:

There's a need to implement QCudaStorage, similar to how QMetalStorage was implemented for Metal, as seen here: QMetalStorage.
The CUDA kernels from ggml-cuda.cu must be ported to candle-kernels and properly integrated into the build process.
Implementation of cuda_fwd for QTensor is needed, akin to the metal_fwd implementation found here: metal_fwd.

akhildevelops commented 9 months ago

Please link any ongoing PR / branch for this feature, if work on it has started.

LaurentMazare commented 8 months ago

You can check out #1754 which contains a first implementation of cuda support for quantized models. It's certainly not optimal in terms of performance and there are a bunch of optimization/kernels to be added but I hope to merge a first cut later today.

huggingface / candle

Quantized models on Cuda #1250

Related open issues