Closed KiwiHana closed 10 months ago
The first token of first generate contains warmup time of ipex. When we benmark, we always use 2+ generate time, in your case, 0.5151s / 0.4186s as first token latency for llama2 and chatglm2.
This is A730M, ChatGLM2's 1st token latency < 1s. https://github.com/intel-analytics/BigDL/issues/8853
Do you know which team can provide right environment dependencies to solve ChatGLM2 has long 1st token latency 145s on A380?
After offline discussion and validation of Bloomz-560m with @KiwiHana , it seems that ipex xpu does take too much time to warmup on A380. You may consult ipex gpu team for any suggestion.
Double check with bloomz-560m by iPEX. bloomz-560m has long 1st token latency 110s too. Will submit a ticket to iPEX.
download model:https://huggingface.co/bigscience/bloomz-560m/tree/main
$ python test_bloomz.py
/home/ubuntu/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
Namespace(model_dir='/home/ubuntu/Downloads/kiwi/llm/bloomz-560m/', max_new_tokens=32, input_tokens='32')
Test /home/ubuntu/Downloads/kiwi/llm/bloomz-560m/...
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
<class 'transformers.models.bloom.modeling_bloom.BloomForCausalLM'>
torch.float16
torch.Size([1, 33])
torch.Size([1, 33])
=========First token cost 110.3764s=========
torch.Size([1, 33])
=========First token cost 0.0499s=========
torch.Size([1, 33])
=========First token cost 0.0200s=========
torch.Size([1, 33])
=========First token cost 0.0198s=========
torch.Size([1, 33])
=========First token cost 0.0191s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0201s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
Prompt: 我总是在晚上失眠,这个症状已经持续很长时间,所以晚上睡不着到底应该怎么处理,请告诉我一些可行的建议与方法,越详细越好
Output: ['我', '总', '是在', '晚上', '失眠', ',', '这个', '症状', '已经', '持续', '很', '长时间', ',', '所以', '晚上', '睡', '不着', '到底', '应该', '怎么', '处理', ',', '请', '告诉我', '一些', '可行的', '建议', '与', '方法', ',', '越', '详细', '越好', '']
Inference time: 0.02396416664123535 s
[136.0453770160675, 0.054793596267700195, 0.023982524871826172, 0.024051427841186523, 0.022850751876831055, 0.02391529083251953, 0.023853063583374023, 0.023738861083984375, 0.023859024047851562, 0.02396416664123535]
import torch
import intel_extension_for_pytorch as ipex
import os
import time
from transformers import AutoModel,BloomForCausalLM
from transformers import AutoTokenizer
import numpy as np
from itertools import chain
import pathlib
import argparse
import json
from gpu_benchmark_util import BenchmarkWrapper
if __name__ == '__main__':
parser = argparse.ArgumentParser('OPT generation script', add_help=False)
parser.add_argument('-m', '--model-dir', default="/home/ubuntu/Downloads/kiwi/llm/bloomz-560m/", type=str)
parser.add_argument('--max-new-tokens', default=32, type=int, help="output max new tokens")
parser.add_argument('--input-tokens', default='32', type=str)
args = parser.parse_args()
print(args)
prompt_32 = "我总是在晚上失眠,这个症状已经持续很长时间,所以晚上睡不着到底应该怎么处理,请告诉我一些可行的建议与方法,越详细越好"
prompt = prompt_32
print(f"Test {args.model_dir}...")
model = BloomForCausalLM.from_pretrained(args.model_dir, trust_remote_code=True) ## transformer
model = model.half().to('xpu')
# model = model.to('xpu')
model = BenchmarkWrapper(model)
print(model.dtype)
tokenizer = AutoTokenizer.from_pretrained(args.model_dir, trust_remote_code=True)
inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
print(inputs["input_ids"].shape)
total_list = []
e2e_time = []
with torch.inference_mode():
for i in range(10):
torch.xpu.synchronize()
st = time.time()
inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
print(inputs["input_ids"].shape)
output = model.generate(**inputs, do_sample=False, temperature=0.9, max_new_tokens=32)
gen_ids = output[0]
gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
torch.xpu.synchronize()
end = time.time()
e2e_time.append(end-st)
print('Prompt:', prompt)
print('Output:', gen_text)
print(f'Inference time: {end-st} s')
print(e2e_time)
ChatGLM2 has long 1st token latency 145.3s on A380 Also LLama2-7B has long 1st token latency 126.9s on A380 bigdl-llm[xpu]=2.4.0b20230905 and 2.4.0b20230827 Ubuntu22.04, driver 23.33.027067, kernel 5.15.47