intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.3k stars 1.23k forks source link

ChatGLM2 has long 1st token latency 145s on A380 while Llama2-7b 126.9s #8904

Closed KiwiHana closed 10 months ago

KiwiHana commented 10 months ago

ChatGLM2 has long 1st token latency 145.3s on A380 Also LLama2-7B has long 1st token latency 126.9s on A380 bigdl-llm[xpu]=2.4.0b20230905 and 2.4.0b20230827 Ubuntu22.04, driver 23.33.027067, kernel 5.15.47

$ python test_chatglm2.py
/home/ubuntu/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Namespace(model_dir='/home/ubuntu/Downloads/kiwi/llm/chatglm2-6b/', max_new_tokens=32, input_tokens='32')
Test /home/ubuntu/Downloads/kiwi/llm/chatglm2-6b/...
Loading checkpoint shards: 100%|███████████████████████████████████████████████| 7/7 [00:07<00:00,  1.06s/it]
<class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'>
torch.float32
torch.Size([1, 32])
torch.Size([1, 32])
2023-09-06 09:10:27,374 - gpu_benchmark_util - WARNING - Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/ubuntu/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
=========First token cost 145.3891s=========
=========Rest tokens cost average 0.0584s (31 tokens in all)=========
torch.Size([1, 32])
2023-09-06 09:13:23,193 - gpu_benchmark_util - WARNING - Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
=========First token cost 0.4186s=========
=========Rest tokens cost average 0.0570s (31 tokens in all)=========
torch.Size([1, 32])
$ python test_llama2.py
=========First token cost 126.6797s=========
=========Rest tokens cost average 0.0733s (31 tokens in all)=========
=========First token cost 0.5151s=========
=========Rest tokens cost average 0.0721s (31 tokens in all)=========
cost 2.7540s
import torch
import intel_extension_for_pytorch as ipex
import os
import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import numpy as np
from itertools import chain
import pathlib
import argparse
import json
from gpu_benchmark_util import BenchmarkWrapper

if __name__ == '__main__':
    parser = argparse.ArgumentParser('OPT generation script', add_help=False)
    parser.add_argument('-m', '--model-dir', default="/home/ubuntu/Downloads/kiwi/llm/chatglm2-6b/", type=str)
    parser.add_argument('--max-new-tokens', default=32, type=int, help="output max new tokens")
    parser.add_argument('--input-tokens', default='32', type=str)
    args = parser.parse_args()
    print(args)

    prompt_32 = "我总是在晚上失眠,这个症状已经持续很长时间,所以晚上睡不着到底应该怎么处理,请告诉我一些可行的建议与方法,越详细越好"

    prompt = prompt_32

    print(f"Test {args.model_dir}...")
    # load_in_4bit=True in bigdl.llm.transformers will convert
    # the relevant layers in the model into int4 format
    model = AutoModel.from_pretrained(args.model_dir, load_in_4bit=True, optimize_model=False, trust_remote_code=True)
  #  model =  AutoModel.load_low_bit(args.model_dir, trust_remote_code=True, optimize_model=False)
    model = model.to('xpu')
    model = BenchmarkWrapper(model)
    print(model.dtype)
    tokenizer = AutoTokenizer.from_pretrained(args.model_dir, trust_remote_code=True)
    inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
    print(inputs["input_ids"].shape)

    total_list = []
    e2e_time = []
    with torch.inference_mode():
        for i in range(10):
            torch.xpu.synchronize()
            st = time.time()
            inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
            print(inputs["input_ids"].shape)
            output = model.generate(**inputs, do_sample=False, temperature=0.9, max_new_tokens=32)
            gen_ids = output[0]
            gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
            torch.xpu.synchronize()
            end = time.time()
            e2e_time.append(end-st)

    print('Prompt:', prompt)
    print('Output:', gen_text)
    print(f'Inference time: {end-st} s')
    print(e2e_time)
rnwang04 commented 10 months ago

The first token of first generate contains warmup time of ipex. When we benmark, we always use 2+ generate time, in your case, 0.5151s / 0.4186s as first token latency for llama2 and chatglm2.

KiwiHana commented 10 months ago

This is A730M, ChatGLM2's 1st token latency < 1s. https://github.com/intel-analytics/BigDL/issues/8853

Do you know which team can provide right environment dependencies to solve ChatGLM2 has long 1st token latency 145s on A380?

rnwang04 commented 10 months ago

After offline discussion and validation of Bloomz-560m with @KiwiHana , it seems that ipex xpu does take too much time to warmup on A380. You may consult ipex gpu team for any suggestion.

KiwiHana commented 10 months ago

Double check with bloomz-560m by iPEX. bloomz-560m has long 1st token latency 110s too. Will submit a ticket to iPEX.

download model:https://huggingface.co/bigscience/bloomz-560m/tree/main

$ python test_bloomz.py
/home/ubuntu/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Namespace(model_dir='/home/ubuntu/Downloads/kiwi/llm/bloomz-560m/', max_new_tokens=32, input_tokens='32')
Test /home/ubuntu/Downloads/kiwi/llm/bloomz-560m/...
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
<class 'transformers.models.bloom.modeling_bloom.BloomForCausalLM'>
torch.float16
torch.Size([1, 33])
torch.Size([1, 33])
=========First token cost 110.3764s=========
torch.Size([1, 33])
=========First token cost 0.0499s=========
torch.Size([1, 33])
=========First token cost 0.0200s=========
torch.Size([1, 33])
=========First token cost 0.0198s=========
torch.Size([1, 33])
=========First token cost 0.0191s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0201s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
torch.Size([1, 33])
=========First token cost 0.0202s=========
Prompt: 我总是在晚上失眠,这个症状已经持续很长时间,所以晚上睡不着到底应该怎么处理,请告诉我一些可行的建议与方法,越详细越好
Output: ['我', '总', '是在', '晚上', '失眠', ',', '这个', '症状', '已经', '持续', '很', '长时间', ',', '所以', '晚上', '睡', '不着', '到底', '应该', '怎么', '处理', ',', '请', '告诉我', '一些', '可行的', '建议', '与', '方法', ',', '越', '详细', '越好', '']
Inference time: 0.02396416664123535 s
[136.0453770160675, 0.054793596267700195, 0.023982524871826172, 0.024051427841186523, 0.022850751876831055, 0.02391529083251953, 0.023853063583374023, 0.023738861083984375, 0.023859024047851562, 0.02396416664123535]
import torch
import intel_extension_for_pytorch as ipex
import os
import time
from transformers import AutoModel,BloomForCausalLM
from transformers import AutoTokenizer
import numpy as np
from itertools import chain
import pathlib
import argparse
import json
from gpu_benchmark_util import BenchmarkWrapper

if __name__ == '__main__':
    parser = argparse.ArgumentParser('OPT generation script', add_help=False)
    parser.add_argument('-m', '--model-dir', default="/home/ubuntu/Downloads/kiwi/llm/bloomz-560m/", type=str)
    parser.add_argument('--max-new-tokens', default=32, type=int, help="output max new tokens")
    parser.add_argument('--input-tokens', default='32', type=str)
    args = parser.parse_args()
    print(args)

    prompt_32 = "我总是在晚上失眠,这个症状已经持续很长时间,所以晚上睡不着到底应该怎么处理,请告诉我一些可行的建议与方法,越详细越好"
    prompt = prompt_32

    print(f"Test {args.model_dir}...")
    model = BloomForCausalLM.from_pretrained(args.model_dir, trust_remote_code=True)  ## transformer
    model = model.half().to('xpu')
  #  model = model.to('xpu')
    model = BenchmarkWrapper(model)
    print(model.dtype)
    tokenizer = AutoTokenizer.from_pretrained(args.model_dir, trust_remote_code=True)
    inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
    print(inputs["input_ids"].shape)

    total_list = []
    e2e_time = []
    with torch.inference_mode():
        for i in range(10):
            torch.xpu.synchronize()
            st = time.time()
            inputs = tokenizer([prompt], return_tensors="pt").to('xpu')
            print(inputs["input_ids"].shape)
            output = model.generate(**inputs, do_sample=False, temperature=0.9, max_new_tokens=32)
            gen_ids = output[0]
            gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
            torch.xpu.synchronize()
            end = time.time()
            e2e_time.append(end-st)

    print('Prompt:', prompt)
    print('Output:', gen_text)
    print(f'Inference time: {end-st} s')
    print(e2e_time)