intel / xFasterTransformer

Apache License 2.0
322 stars 56 forks source link

llama-2-7B benchmarking error with chinese prompts #380

Closed qdym188 closed 2 months ago

qdym188 commented 2 months ago

Version: docker pull intel/xfastertransformer:1.6.0 model source:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf hardware: Xeon hbm model conversion:
python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("/data/llama2-7B-Chat")' prompts: prompt_xft.json Benchmarking: OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python benchmark.py --token_path /data/llama2-7B-Chat --model_path /data/llama2-7B-Chat-xft --prompt_path prompt.json --model_name llama-2-7b --dtype bf16 --batch_size 1 --token_in 1024 --token_out 512 --beam_width 1 --iteration 3 --padding=False result: 1st latency: 10943.5ms If used the default prompt: 1st latency: 530.78ms

The 1st latency of chn input is much bigger than eng input.

Duyi-Wang commented 2 months ago

Since the token length of your prompt is 1978 and the prompt has 1243 chinese characters.

1243
torch.Size([1, 1978])