Inference on an AMD mid-range GPU achieves 1.5 tokens/s, while LM Studio can reach 65 tokens/s

ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.

MIT License

100.74k stars 8.03k forks source link

Server logs will aid in debugging. The usual reason for slow generation is the model running in CPU rather than GPU. For example, if LM Studio was running when you did the ollama test, there may not have beeen enough VRAM for ollama to (fully) load the model.

$ ollama run llama3.1 --verbose
>>> Hi, tell me a joke
Here's one:

What do you call a fake noodle?

An impasta!

Hope that made you laugh! Do you want to hear another one?

total duration:       478.994822ms
load duration:        32.503088ms
prompt eval count:    16 token(s)
prompt eval duration: 69ms
prompt eval rate:     231.88 tokens/s
eval count:           32 token(s)
eval duration:        375ms
eval rate:            85.33 tokens/s

Screenshot from 2024-12-02 16-52-20

ollama / ollama

Inference on an AMD mid-range GPU achieves 1.5 tokens/s, while LM Studio can reach 65 tokens/s #7911

What is the issue?

OS

GPU

CPU

Ollama version