Closed qdym188 closed 2 months ago
Version: docker pull intel/xfastertransformer:1.6.0 model source: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf hardware: Xeon hbm model conversion: python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("/data/llama2-7B-Chat")' prompts: prompt_xft.json Benchmarking: OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python benchmark.py --token_path /data/llama2-7B-Chat --model_path /data/llama2-7B-Chat-xft --prompt_path prompt.json --model_name llama-2-7b --dtype bf16 --batch_size 1 --token_in 1024 --token_out 512 --beam_width 1 --iteration 3 --padding=False result: 1st latency: 10943.5ms If used the default prompt: 1st latency: 530.78ms
The 1st latency of chn input is much bigger than eng input.
Since the token length of your prompt is 1978 and the prompt has 1243 chinese characters.
1243 torch.Size([1, 1978])
Version: docker pull intel/xfastertransformer:1.6.0 model source:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf hardware: Xeon hbm model conversion:
python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("/data/llama2-7B-Chat")' prompts: prompt_xft.json Benchmarking: OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python benchmark.py --token_path /data/llama2-7B-Chat --model_path /data/llama2-7B-Chat-xft --prompt_path prompt.json --model_name llama-2-7b --dtype bf16 --batch_size 1 --token_in 1024 --token_out 512 --beam_width 1 --iteration 3 --padding=False result: 1st latency: 10943.5ms If used the default prompt: 1st latency: 530.78ms
The 1st latency of chn input is much bigger than eng input.