intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.66k stars 1.26k forks source link

The first token latency is not present in the output timing message #9203

Open shuangpe opened 1 year ago

shuangpe commented 1 year ago

I'd like to benchmark the optimized performance of LLAMA2 model with BigDL acceleration on SPR machine.

I followed the README in python/llm/example/CPU/Native-Models, which executed normally and printed the timing message.

However, in the timing message, the prompt eval time (which is also the first token latency) is abnormal, as shown below.

bigdl-llm timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)

The prompt eval time is zero, and the number of tokens didn't include the prompt tokens, it's different from ggml llama.cpp output message.

liu-shaojun commented 1 year ago

@shuangpe Thanks for reporting this issue, we have reproduced this issue on SPR, and will fix it once we have bandwidth.

For now, you could use llm-cli to benchmark following https://github.com/intel-analytics/BigDL/tree/main/python/llm#4-cli-tool

conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]

llm-convert "/path/to/model/" --model-format pth --model-family "llama" --outfile "/path/to/output/"

numactl -C 0-47 -m 0 llm-cli -m "/path/to/output/model" -x "llama" -t 48 -n 32 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."

We have tested the above scripts and the time can be printed correctly. Feel free to contact me if you have further questions.