Open e1ijah1 opened 2 days ago
I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU.
Have you tried using just 1 GPU?
You're right! When using a single GPU, I get normal GPU utilization, but there seems to be a problem with the responses from the Unsloth finetuned model.
I did a comparison test:
The output quality from the Q3_K_M.gguf model appears to be degraded compared to the AWQ version. The responses seem less coherent and don't maintain the same level of quality as the original model.
Has anyone else encountered similar issues with the quantized GGUF version? I'm wondering if this is related to the quantization process or if there are specific settings I should adjust to improve the output quality.
I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU.
Have you tried using just 1 GPU?
You're right! When using a single GPU, I get normal GPU utilization, but there seems to be a problem with the responses from the Unsloth finetuned model.
I did a comparison test:
- Fig 1: Response generated using llama.cpp with Qwen2.5-Coder-32B-Instruct-Q3_K_M.gguf
- Fig 2: Response generated using vllm with Qwen2.5-Coder-32B-Instruct-AWQ
The output quality from the Q3_K_M.gguf model appears to be degraded compared to the AWQ version. The responses seem less coherent and don't maintain the same level of quality as the original model.
Has anyone else encountered similar issues with the quantized GGUF version? I'm wondering if this is related to the quantization process or if there are specific settings I should adjust to improve the output quality.
I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU. Have you tried using just 1 GPU?
I've encountered the same issue - seems to be a llama.cpp problem. I tested with the official Qwen GGUF models as well and noticed repetitive/degraded outputs compared to the AWQ versions. The responses tend to loop or become less coherent.
Hi all,
I'm trying to run inference on Unsloth finetuned models. I'm using llama.cpp with 2x RTX 4090 GPUs to benchmark the performance of
Qwen2.5-Coder-14B-Instruct-128K-GGUF/Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf
. I found that the GPU utilization stays below 60% on both GPUs and the inference speed is slower than expected.Here's my docker command:
Is this expected behavior when running Unsloth-finetuned models through llama.cpp? Are there any recommended settings to improve GPU utilization and performance?