Inference speed so slow on T4

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

16.84k stars 1.16k forks source link

Inference speed so slow on T4 #844

Open lullabies777 opened 2 months ago

lullabies777 commented 2 months ago

I tried running Nemo-12b 4-bit model on one T4 GPU, but the inference speed is very slow. Additionally, the 'forward' function takes much longer than 'generate'. Is there a speedup benchmark for the T4? I'm wondering if I'm doing in the right way.

danielhanchen commented 2 months ago

Are you using FastLanguageModel.for_inference(model) for inference?