unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.19k stars 1.27k forks source link

Low GPU utilization when running Unsloth-finetuned Qwen2.5-Coder-14B-Instruct-128K-GGUF #1292

Open e1ijah1 opened 2 days ago

e1ijah1 commented 2 days ago

Hi all,

I'm trying to run inference on Unsloth finetuned models. I'm using llama.cpp with 2x RTX 4090 GPUs to benchmark the performance of Qwen2.5-Coder-14B-Instruct-128K-GGUF/Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf. I found that the GPU utilization stays below 60% on both GPUs and the inference speed is slower than expected.

Here's my docker command:

docker run --rm -p 8080:8080 \
        -v /weights:/models \
        --gpus '"device=0,1"' \
        ghcr.io/ggerganov/llama.cpp:server-cuda \
        -m models/Qwen2.5-Coder-14B-Instruct-128K-GGUF/Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf \
        -a 'Qwen/Qwen2.5-Coder-14B-Instruct' \
        -c 92160 \
        --host 0.0.0.0 \
        --port 8080 \
        --n-gpu-layers 99 \
        --parallel 2 \
        --cont-batching \
        --mlock

Is this expected behavior when running Unsloth-finetuned models through llama.cpp? Are there any recommended settings to improve GPU utilization and performance?

image
danielhanchen commented 1 day ago

I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU.

Have you tried using just 1 GPU?

e1ijah1 commented 1 day ago

You're right! When using a single GPU, I get normal GPU utilization, but there seems to be a problem with the responses from the Unsloth finetuned model.

I did a comparison test:

The output quality from the Q3_K_M.gguf model appears to be degraded compared to the AWQ version. The responses seem less coherent and don't maintain the same level of quality as the original model.

Has anyone else encountered similar issues with the quantized GGUF version? I'm wondering if this is related to the quantization process or if there are specific settings I should adjust to improve the output quality.

Fig 1 Fig 2

I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU.

Have you tried using just 1 GPU?

e1ijah1 commented 1 day ago

You're right! When using a single GPU, I get normal GPU utilization, but there seems to be a problem with the responses from the Unsloth finetuned model.

I did a comparison test:

  • Fig 1: Response generated using llama.cpp with Qwen2.5-Coder-32B-Instruct-Q3_K_M.gguf
  • Fig 2: Response generated using vllm with Qwen2.5-Coder-32B-Instruct-AWQ

The output quality from the Q3_K_M.gguf model appears to be degraded compared to the AWQ version. The responses seem less coherent and don't maintain the same level of quality as the original model.

Has anyone else encountered similar issues with the quantized GGUF version? I'm wondering if this is related to the quantization process or if there are specific settings I should adjust to improve the output quality.

Fig 1 Fig 2

I'm assuming non finetuned variants are also like this? It's probably memory bandwidth issues since RTX 4090 don't have NVLink I think, so maybe the memory transfers are bottenecking the GPU. Have you tried using just 1 GPU?

I've encountered the same issue - seems to be a llama.cpp problem. I tested with the official Qwen GGUF models as well and noticed repetitive/degraded outputs compared to the AWQ versions. The responses tend to loop or become less coherent.