ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
100.74k stars 8.03k forks source link

Inference on an AMD mid-range GPU achieves 1.5 tokens/s, while LM Studio can reach 65 tokens/s #7911

Open blizzardwj opened 1 day ago

blizzardwj commented 1 day ago

What is the issue?

I'm just testing out Llama 3.1 8B q4_k_m with “tell me a joke". The performance gap between them is pretty huge. I'm using Ollama + Nvidia for work, but for home fun, I went with a 7800XT. I'm wondering if the performance issue is due to my setup or a limitation in Ollama itself.

Both are running Adrenalin Edition == 24.10.1

image

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.4.7

rick-github commented 1 day ago

Server logs will aid in debugging. The usual reason for slow generation is the model running in CPU rather than GPU. For example, if LM Studio was running when you did the ollama test, there may not have beeen enough VRAM for ollama to (fully) load the model.

$ ollama run llama3.1 --verbose
>>> Hi, tell me a joke
Here's one:

What do you call a fake noodle?

An impasta!

Hope that made you laugh! Do you want to hear another one?

total duration:       478.994822ms
load duration:        32.503088ms
prompt eval count:    16 token(s)
prompt eval duration: 69ms
prompt eval rate:     231.88 tokens/s
eval count:           32 token(s)
eval duration:        375ms
eval rate:            85.33 tokens/s

Screenshot from 2024-12-02 16-52-20