Open lullabies777 opened 2 months ago
I tried running Nemo-12b 4-bit model on one T4 GPU, but the inference speed is very slow. Additionally, the 'forward' function takes much longer than 'generate'. Is there a speedup benchmark for the T4? I'm wondering if I'm doing in the right way.
Are you using FastLanguageModel.for_inference(model) for inference?
FastLanguageModel.for_inference(model)
I tried running Nemo-12b 4-bit model on one T4 GPU, but the inference speed is very slow. Additionally, the 'forward' function takes much longer than 'generate'. Is there a speedup benchmark for the T4? I'm wondering if I'm doing in the right way.