Open blizzardwj opened 1 day ago
Server logs will aid in debugging. The usual reason for slow generation is the model running in CPU rather than GPU. For example, if LM Studio was running when you did the ollama test, there may not have beeen enough VRAM for ollama to (fully) load the model.
$ ollama run llama3.1 --verbose
>>> Hi, tell me a joke
Here's one:
What do you call a fake noodle?
An impasta!
Hope that made you laugh! Do you want to hear another one?
total duration: 478.994822ms
load duration: 32.503088ms
prompt eval count: 16 token(s)
prompt eval duration: 69ms
prompt eval rate: 231.88 tokens/s
eval count: 32 token(s)
eval duration: 375ms
eval rate: 85.33 tokens/s
What is the issue?
I'm just testing out Llama 3.1 8B q4_k_m with “tell me a joke". The performance gap between them is pretty huge. I'm using Ollama + Nvidia for work, but for home fun, I went with a 7800XT. I'm wondering if the performance issue is due to my setup or a limitation in Ollama itself.
Both are running Adrenalin Edition == 24.10.1
OS
Windows
GPU
AMD
CPU
AMD
Ollama version
0.4.7