Open EmilLindfors opened 1 year ago
Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).
Commenting here, I'd love to have Cuda + quantization support as well.
I already created another issue, some time ago for this: https://github.com/huggingface/candle/issues/655
But i'm also very interested in getting cuda acceleration working for quantized tensors, but i think it would be wise to wait for https://github.com/huggingface/candle/pull/1230 to mature a bit as they already added the whole Device
scaffolding to the quantized implementation, which we will also need to support cuda acceleration.
Other than that this should theoretically be relatively simple as the quantized cuda kernels already exist in the ggml
\ llama.cpp
projects. They even have some matmul
kernels now in addition to the older vecdot
kernels.
Commenting to bump priority, as requested.
Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).
Support for quantisation: https://github.com/huggingface/candle/issues/359 CUDA support for QMatMul: https://github.com/huggingface/candle/issues/655 Error: no cuda implementation for qmatmul: https://github.com/huggingface/candle/issues/696 You are here: https://github.com/huggingface/candle/issues/1250
come on, need it, thanks.
This will be a game changer in running bigger LLM in consumer grade GPU as memory is the main constrain. Big thanks for the efforts of everyone. This is awesome framework! Love Rust and cannot stand to code in python ....
Looks like an exciting development.
Hi, guys, I ask for the progress on supporting the quantization on cuda politely, is there any new info? In the following days I have some time to help, if needed.
@miketang84
Due to time constraints, I wasn't able to dive deeper into this. However, for enabling gguf quantizations with CUDA, essentially three steps are required:
QCudaStorage
, similar to how QMetalStorage
was implemented for Metal, as seen here: QMetalStorage.cuda_fwd
for QTensor is needed, akin to the metal_fwd
implementation found here: metal_fwd
.Please link any ongoing PR / branch for this feature, if work on it has started.
You can check out #1754 which contains a first implementation of cuda support for quantized models. It's certainly not optimal in terms of performance and there are a bunch of optimization/kernels to be added but I hope to merge a first cut later today.
Hello! Are there any plans on implementing quantized models on cuda devices? Would be great to be able to run the forthcoming 14b mistral on a 3090 with e.g. q_8.