Open shuangpe opened 1 year ago
@shuangpe Thanks for reporting this issue, we have reproduced this issue on SPR, and will fix it once we have bandwidth.
For now, you could use llm-cli to benchmark following https://github.com/intel-analytics/BigDL/tree/main/python/llm#4-cli-tool
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
llm-convert "/path/to/model/" --model-format pth --model-family "llama" --outfile "/path/to/output/"
numactl -C 0-47 -m 0 llm-cli -m "/path/to/output/model" -x "llama" -t 48 -n 32 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."
We have tested the above scripts and the time can be printed correctly. Feel free to contact me if you have further questions.
I'd like to benchmark the optimized performance of LLAMA2 model with BigDL acceleration on SPR machine.
I followed the README in python/llm/example/CPU/Native-Models, which executed normally and printed the timing message.
However, in the timing message, the prompt eval time (which is also the first token latency) is abnormal, as shown below.
The prompt eval time is zero, and the number of tokens didn't include the prompt tokens, it's different from ggml llama.cpp output message.