Quantized-t5 models on Cuda

huggingface / candle

Minimalist ML framework for Rust

Apache License 2.0

14.33k stars 806 forks source link

Quantized-t5 models on Cuda #2266

Open helizac opened 3 weeks ago

helizac commented 3 weeks ago

Hello! Are there any plans on implementing quantized-t5 models on CUDA devices? I'm looking for a couple of days to find the solution or implement a CUDA support for https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized-t5/main.rs but I couldn't figure it out. Would be great to be able to run the customized and quantized-t5 models with GPU support.

Nikita-Sherstnev commented 1 week ago

Hi, it actually works with CUDA, just change both occurrences of Device::Cpu in the code to Device::cuda_if_available(0)? At least it worked with jbochi/madlad400-3b-mt.

helizac commented 1 week ago

I actually tried this. Also, I tried some implementations from other examples. Even though I made the changes, it only uses the CPU as a resource and the GPU usage remains the same all the time.

I'm trying it on helizac/TURNA_GGUF which I quantized from boun-tabi-LMG/TURNA (t5 model based on the UL2 framework) using candle.

If I can find a solution for such a situation, I will share it here. Thank you for your help, but for now the problem continues for this model.