llama-2-7B benchmarking error with chinese prompts

Version: docker pull intel/xfastertransformer:1.6.0 model source:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf hardware: Xeon hbm model conversion:
python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("/data/llama2-7B-Chat")' prompts: prompt_xft.json Benchmarking: OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python benchmark.py --token_path /data/llama2-7B-Chat --model_path /data/llama2-7B-Chat-xft --prompt_path prompt.json --model_name llama-2-7b --dtype bf16 --batch_size 1 --token_in 1024 --token_out 512 --beam_width 1 --iteration 3 --padding=False result: 1st latency: 10943.5ms If used the default prompt: 1st latency: 530.78ms

The 1st latency of chn input is much bigger than eng input.

intel / xFasterTransformer

llama-2-7B benchmarking error with chinese prompts #380