Open YUO4YUM opened 7 months ago
@YUO4YUM I try to use 17000 tokens to inference but I can't reproduce this issue. Maybe you need to follow this guide to upgrade bigdl-llm to ipex-llm and use the latest ipex-llm to run the latest example code . If you still have problems, please provide the lastest code and bash command for me to reproduce.
Thanks for the reply! @hzjane I tried ipex-lm for inference, the CPU curve is normal this time, but I encountered a new problem
Traceback (most recent call last):
File "E:\intel-amd\long_text_infer.py", line 192, in <module>
llm_result = llm(prompt, max_new_tokens=128)
File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 786, in __call__
self.generate(
File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 582, in generate
output = self._generate_helper(
File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 488, in _generate_helper
raise e
File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 475, in _generate_helper
self._generate(
File "D:\anaconda3\envs\vchat\lib\site-packages\langchain\llms\base.py", line 961, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\ipex_llm\langchain\llms\transformersllm.py", line 260, in _call
output = self.model.generate(input_ids, streamer=streamer,
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\transformers\generation\utils.py", line 1538, in generate
return self.greedy_search(
File "D:\anaconda3\envs\vchat\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search
outputs = self(
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 940, in forward
transformer_outputs = self.transformer(
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 833, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 643, in forward
layer_ret = layer(
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\mh-chen/.cache\huggingface\modules\transformers_modules\chatglm3-6b-32k-optimized\modeling_chatglm.py", line 547, in forward
attention_output, kv_cache = self.self_attention(
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\envs\vchat\lib\site-packages\ipex_llm\transformers\models\chatglm2_32k.py", line 148, in chatglm2_32k_attention_forward
cache_k, cache_v = kv_cache
ValueError: not enough values to unpack (expected 2, got 1)
import ipex_llm
import torch
from bigdl.llm.langchain.llms import TransformersLLM
from bigdl.llm.transformers import AutoModel
from bigdl.llm import optimize_model
# ChatGLM3-6b-32k optimize
print("\033[1;32mOptimize LLM...\033[0m")
model = AutoModel.from_pretrained('./checkpoints/chatglm3-6b-32k',
load_in_low_bit="sym_int8",
trust_remote_code=True)
model.save_low_bit('./checkpoints/chatglm3-6b-32k-optimized')
tokenizer = AutoTokenizer.from_pretrained('./checkpoints/chatglm3-6b-32k',
trust_remote_code=True)
tokenizer.save_pretrained('./checkpoints/chatglm3-6b-32k-optimized')
# load model
llm = TransformersLLM.from_model_id_low_bit(f"./checkpoints/chatglm3-6b-32k-optimized",
{"trust_remote_code": True, "max_length": 4096})
query ="""
"""
template = """
You are a text summary expert, proficient in extracting event from user's query. \n
You need to extract the events that occurred from the user's query.\n
% USER QUERY:
{query}
YOUR RESPONSE:
"""
with torch.inference_mode():
prompt = template.format(query=query)
llm_result = llm(prompt, max_new_tokens=128)
ipex-llm==2.1.0b20240407
transformers==4.31.0
I copied the text of this blog as query https://www.luminis.eu/blog/llm-series-part-1-a-comprehensive-introduction-to-large-language-models/
@YUO4YUM I meet this error too when running chatglm3-6b-32k
. And we will fix it by this pr.
@YUO4YUM We have fixed this issue, and you could use the latest ipex-llm version to test.
When using Chatglm3 to infer 17000 tokens input, I encountered a strange CPU performance curve where I was unable to obtain inference results for a long time
environment
code
performance curve with 17000 token
performance curve with 1700 token
When I was reasoning sentences with 17000 tokens, this CPU took up strange jumping up and down, and the reasoning time was very long When I reduced the number of input tokens to 1700, the performance curve was normal and I was able to get results quickly
question
I would like to know if the Bigdl support inference for more than 10000 tokens?