The first token latency is not present in the output timing message

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

Apache License 2.0

6.66k stars 1.26k forks source link

@shuangpe Thanks for reporting this issue, we have reproduced this issue on SPR, and will fix it once we have bandwidth.

For now, you could use llm-cli to benchmark following https://github.com/intel-analytics/BigDL/tree/main/python/llm#4-cli-tool

conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]

llm-convert "/path/to/model/" --model-format pth --model-family "llama" --outfile "/path/to/output/"

numactl -C 0-47 -m 0 llm-cli -m "/path/to/output/model" -x "llama" -t 48 -n 32 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."

We have tested the above scripts and the time can be printed correctly. Feel free to contact me if you have further questions.

intel-analytics / ipex-llm

The first token latency is not present in the output timing message #9203