intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Garbage output on serving 4 parallel users. #12067

Open adi-lb-phoenix opened 2 months ago

adi-lb-phoenix commented 2 months ago

I started a server with the command OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values. Screenshot_20240911_152331

sgwhat commented 1 month ago

Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux.

adi-lb-phoenix commented 1 month ago

Hello @sgwhat . So I have installed podman and distrobox on kde neon, on which I have created a ubuntu distro using distrobox. Ipex llm is deployed inside the ubuntu distrobox. Inside ubuntu distrobox:

uname -a
Linux ubuntu22_ollama.JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

On the host system

Linux JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The gpu is intel arc A770 GPU.

sgwhat commented 1 month ago

We are currently locating the cause of the codellama output issue on linux arc770 and will notify you as soon as possible.

adi-lb-phoenix commented 1 month ago

@sgwhat Thank you for picking this up. It has been observed not just for codellama but for other models as well.

adi-lb-phoenix commented 1 month ago

https://github.com/ggerganov/llama.cpp/issues/9505#issuecomment-2352561991 Here llama.cpp does not output garbage values

adi-lb-phoenix commented 1 month ago

When serving just one user Ipex llm has better speed than llama.cpp Result from ipex-llm:

llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens

Below is the result from llama.cpp

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens