intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.68k stars 1.26k forks source link

Strange CPU performance curve when I use Chatglm3 to infer sentence inputs with tens of thousands of tokens #10683

Open YUO4YUM opened 7 months ago

YUO4YUM commented 7 months ago

When using Chatglm3 to infer 17000 tokens input, I encountered a strange CPU performance curve where I was unable to obtain inference results for a long time

environment

bigdl-llm==2.4.0

code

from bigdl.llm.langchain.llms import TransformersLLM

llm = TransformersLLM.from_model_id_low_bit(f"../checkpoints/chatglm3-6b-32k-optimized",
                                                             {"trust_remote_code": True, "max_length": max_length})

template = """
    You are a text summary expert, proficient in extracting event from user's query. \n

    You need to extract the events that occurred from the user's query.\n

    % USER QUERY:

    {query}

    YOUR RESPONSE:

    """
with torch.inference_mode():
    import pdb;pdb.set_trace()
    prompt = template.format(query=query)
    llm_result = llm(prompt, max_new_tokens=128)

performance curve with 17000 token

图片

performance curve with 1700 token

图片

When I was reasoning sentences with 17000 tokens, this CPU took up strange jumping up and down, and the reasoning time was very long When I reduced the number of input tokens to 1700, the performance curve was normal and I was able to get results quickly

question

I would like to know if the Bigdl support inference for more than 10000 tokens?

hzjane commented 7 months ago

@YUO4YUM I try to use 17000 tokens to inference but I can't reproduce this issue. Maybe you need to follow this guide to upgrade bigdl-llm to ipex-llm and use the latest ipex-llm to run the latest example code . If you still have problems, please provide the lastest code and bash command for me to reproduce.

YUO4YUM commented 7 months ago

Thanks for the reply! @hzjane I tried ipex-lm for inference, the CPU curve is normal this time, but I encountered a new problem

bug report

Traceback (most recent call last):
  File "E:\intel-amd\long_text_infer.py", line 192, in <module>
    llm_result = llm(prompt, max_new_tokens=128)
  File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 786, in __call__
    self.generate(
  File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 582, in generate
    output = self._generate_helper(
  File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 488, in _generate_helper
    raise e
  File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 475, in _generate_helper
    self._generate(
  File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 961, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\ipex_llm\langchain\llms\transformersllm.py", line 260, in _call
    output = self.model.generate(input_ids, streamer=streamer,
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\transformers\generation\utils.py", line 1538, in generate
    return self.greedy_search(
  File "D:\anaconda3\envs\vchat\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search
    outputs = self(
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 940, in forward
    transformer_outputs = self.transformer(
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 833, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 643, in forward
    layer_ret = layer(
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 547, in forward
    attention_output, kv_cache = self.self_attention(
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\vchat\lib\site-packages\ipex_llm\transformers\models\chatglm2_32k.py", line 148, in chatglm2_32k_attention_forward
    cache_k, cache_v = kv_cache
ValueError: not enough values to unpack (expected 2, got 1)

code

import ipex_llm
import torch
from bigdl.llm.langchain.llms import TransformersLLM
from bigdl.llm.transformers import AutoModel
from bigdl.llm import optimize_model

# ChatGLM3-6b-32k optimize
print("\033[1;32mOptimize LLM...\033[0m")
model = AutoModel.from_pretrained('./checkpoints/chatglm3-6b-32k',
                                    load_in_low_bit="sym_int8",
                                    trust_remote_code=True)
model.save_low_bit('./checkpoints/chatglm3-6b-32k-optimized')
tokenizer = AutoTokenizer.from_pretrained('./checkpoints/chatglm3-6b-32k',
                                            trust_remote_code=True)
tokenizer.save_pretrained('./checkpoints/chatglm3-6b-32k-optimized')

# load model
llm = TransformersLLM.from_model_id_low_bit(f"./checkpoints/chatglm3-6b-32k-optimized",
                                                             {"trust_remote_code": True, "max_length": 4096})

query ="""

"""

template = """
    You are a text summary expert, proficient in extracting event from user's query. \n

    You need to extract the events that occurred from the user's query.\n

    % USER QUERY:

    {query}

    YOUR RESPONSE:

    """
with torch.inference_mode():
    prompt = template.format(query=query)
    llm_result = llm(prompt, max_new_tokens=128)

environment

ipex-llm==2.1.0b20240407
transformers==4.31.0

Text source

I copied the text of this blog as query https://www.luminis.eu/blog/llm-series-part-1-a-comprehensive-introduction-to-large-language-models/

hzjane commented 7 months ago

@YUO4YUM I meet this error too when running chatglm3-6b-32k. And we will fix it by this pr.

hzjane commented 7 months ago

@YUO4YUM We have fixed this issue, and you could use the latest ipex-llm version to test.