Open adi-lb-phoenix opened 2 months ago
Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux.
Hello @sgwhat . So I have installed podman and distrobox on kde neon, on which I have created a ubuntu distro using distrobox. Ipex llm is deployed inside the ubuntu distrobox. Inside ubuntu distrobox:
uname -a
Linux ubuntu22_ollama.JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
On the host system
Linux JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
The gpu is intel arc A770 GPU.
We are currently locating the cause of the codellama
output issue on linux arc770 and will notify you as soon as possible.
@sgwhat Thank you for picking this up. It has been observed not just for codellama
but for other models as well.
https://github.com/ggerganov/llama.cpp/issues/9505#issuecomment-2352561991 Here llama.cpp does not output garbage values
When serving just one user Ipex llm has better speed than llama.cpp Result from ipex-llm:
llama_print_timings: load time = 7797.13 ms
llama_print_timings: sample time = 30.64 ms / 400 runs ( 0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time = 1322.78 ms / 13 tokens ( 101.75 ms per token, 9.83 tokens per second)
llama_print_timings: eval time = 11301.98 ms / 399 runs ( 28.33 ms per token, 35.30 tokens per second)
llama_print_timings: total time = 12711.93 ms / 412 tokens
Below is the result from llama.cpp
llama_perf_sampler_print: sampling time = 31.73 ms / 413 runs ( 0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print: load time = 4317.89 ms
llama_perf_context_print: prompt eval time = 456.68 ms / 13 tokens ( 35.13 ms per token, 28.47 tokens per second)
llama_perf_context_print: eval time = 22846.95 ms / 399 runs ( 57.26 ms per token, 17.46 tokens per second)
llama_perf_context_print: total time = 23379.98 ms / 412 tokens
I started a server with the command
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve
. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt
>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values.